Continuous Authentication Using One-class Classifiers and their Fusion

Continuous Authentication Using One-class Classifiers and their Fusion

Rajesh Kumar
Syracuse University
New York, USA
   Partha Pratim Kundu
Nanyang Technological University
   Vir V. Phoha
Syracuse University
New York, USA

While developing continuous authentication systems , we generally assume that samples from both genuine and impostor classes are readily available. However, the assumption may not be true in certain circumstances. Therefore, we explore the possibility of implementing using only genuine samples. Specifically, we investigate the usefulness of four one-class classifiers (elliptic envelope, isolation forest, local outliers factor, and one-class support vector machines) and their fusion. The performance of these classifiers was evaluated on four distinct behavioral biometric datasets, and compared with eight multi-class classifiers (). The results demonstrate that if we have sufficient training data from the genuine user the , and their fusion can closely match the performance of the majority of . Our findings encourage the research community to use in order to build as they do not require knowledge of impostor class during the enrollment process.

2018 IEEE International Conference on Identity, Security, and Behavior Analysis (ISBA)

978-1-5386-2248-3/18/$31.00  ©2018 IEEE

1 Introduction

The multi-class classifiers have been widely studied for building behavioral biometric based continuous authentication systems [1, 2, 3, 4]. The disadvantages of using is that they require samples from both genuine and impostor classes to determine respective decision boundaries. In other words, to build an -based biometric system, both genuine and impostor samples must be collected [5]. For training and testing the classification models, researchers have conventionally used samples from other users than the genuine one as impostors [6, 1, 7]. The assumption that the impostor samples (samples from the other users) are readily available for training might not be realistic in certain circumstances [8, 5, 9, 10]. For example, (1) individuals might refuse to give their consent for using their biometric data for building authentication systems for someone else, (2) government might pose a restriction on using one’s data for building system for others, and (3) the difficulty in collecting good quality impostor samples– we may have to reveal a great deal of information about the genuine user which could cause privacy concerns [11].

Moreover, Manuele et al. [12] have demonstrated that the choice of the impostor heavily impacts the performance of -based authentication systems. In addition, samples from individuals who were not part of the training database may appear anytime during the verification. This could happen to any realistic (especially adversarial) setup where continuous authentication is applicable. Various studies have advised that even if impostor samples were available, the authentication systems preferably be built by using only genuine samples [9, 13]. Thus, we explored the possibility of implementing by employing genuine samples only. Specifically, we implemented using four different and their fusion, and tested their performance on four distinct behavioral biometric datasets. The are popular and have been extensively studied for outlier or novelty detection [14, 15], as well as in the physiological biometric-based recognition systems [12]. However, practicability of has been rarely explored in the context of motion sensor based [16, 9].

Ding and Ross applied ensemble of one-class SVMs for detecting spoofing attack on fingerprint recognition system [17]. The ensemble of was able to address the insufficient (or unseen) spoof samples problem encountered by conventional spoof detection algorithms. However, the have been rarely explored on motion or touch sensor based biometric datasets. Since these datasets are different in nature and have high variability, it is worth investigating how would the perform on them, especially in the context of continuous authentication. To this end, we hypothesized that if we have sufficient amount of genuine data for training, the could closely match the performance of . The main contribution of this paper is summarized below:

  • We investigate four (novelty or outlier detectors) algorithms and their fusion in the context of motion and touch gesture based continuous authentication system. To the best of our knowledge, three of the which have not been studied in continuous authentication domain before.

  • The performance of was compared with eight well established across four distinct behaviometric datasets using False Accept Rate (FAR), False Reject Rate (FRR), Half Total Error Rate (HTER), and Area Under the Curve (AUC). A series of statistical tests were conducted to compare the top performing classifiers from both and groups.

  • The challenges of implementing using including dynamic score normalization in absence of the impostor score distribution is also discussed.

The rest of the paper is organized as follows: Section 2 presents related work; Section 3 describes the experimental setup; Section 4 discusses the performance evaluation methods and metrics, and Section 5 concludes the work.

2 Related Work

The , especially, have been applied to solve a variety of authentication problems. Examples include face recognition [12], typist recognition [9], smart-stroke [8], touch [5], and mouse dynamics [16]. Antal et al. [5] used four that included Parzen density estimator, the k nearest-neighbor (kNN), Gaussian mixtures method and Support Vector Data Description method in order to build authentication systems based on swipe gestures, however, it was unclear whether their authentication framework was one-time or continuous. Moreover, the swipe gestures and micro-movements of the device were collected in a constrained environment under a very specific scenario – while responding to psychological questionnaire. Hence, the kNN and Parzen density estimator achieved mean Equal Error Rate (EER) as low as 0.024, and 0.023 after combining the decisions from successive swipe gestures.

Antal et al. [8] also compared and in the context of keystroke-based authentication on mobile devices and demonstrated that outperformed with 4% of error rate difference. Hempstalk et al. [9] combined the density and class probability estimation to improve the classification performance. They also conducted experiments by using the artificially generated impostor samples and pose the question on how the quality and quantity of artificially generated impostor samples may affect the overall performance. Shen et al. [16] applied SVM-, Neural Network-, and KNN-based on the mouse-usage patterns. They report the Half Total Error Rates (HTER) of , , and respectively on a dataset of 5550 mouse-operation samples collected from 37 subjects. Also, they strongly argued that one-class methods are more suitable for user authentication in real-world applications.

However, none of the above papers have studied the classifiers that are studied in this paper with an exception of one-class support vector machines. To the best of our knowledge, the evaluation of has not been done across distinct behavioral biometric datasets before. Moreover, we explore the fusion at score and decision level and discuss the challenges that we faced due to continuous authentication paradigm.

3 Design of Experiments

3.1 Continuous Authentication

Continuous authentication is a process in which users are unobtrusively monitored at frequent intervals throughout their interaction with any device or system [18]. The pose different challenges compared to the one-time (or login) time authentication systems. Examples include the availability of data throughout the user interaction, high intra-user variance, authentication accuracy, and resource consumption. At the same time, the do not have to be as accurate as the login time authentication systems. Because the verification happens at quite frequent intervals and users can be locked out after certain successive rejects. To implement the continuous part of the system, generally, a sliding window-based mechanism is used [6, 2, 19]. The authentication decisions are given either based on the patterns captured in the current window or in the last few windows. We followed the window-based feature extraction strategy for all four datasets that were studied. The preprocessing, window-size, sliding intervals, and the set of features are kept exactly as advised in the works that have originally proposed the corresponding data set.

3.2 Datasets and Feature Analysis

We used four distinct behavioral datasets that included phone-accelerometer based gait patterns, watch-accelerometer based gait, watch-gyroscope based gait, and fusion of swiping and phone movement patterns. These datasets were built with the aim of replicating realistic environment. The training and testing data were collected in separate sessions. The specific details of each dataset are provided below.

3.2.1 Phone Acceleration-based Gait Biometric

This dataset consists of walking patterns collected through smartphone accelerometer from 18 users who were either faculty, staff or students [19]. Android’s type_linear_acceleration was used that recorded the accelerations (with no gravity component) in the sensor’s own frame of reference. The data was collected in two separate sessions, separated by two to three days, referred to as training and testing. The participants walked back and forth freely for about 200 meters keeping HTC One M8 smartphone in their pant pocket. The sampling rate of the accelerometer was set to normal which produced around 46 samples per second. The steps of data preprocessing, feature analysis, generation of genuine and impostor samples for training and testing were replicated exactly as advised in [19]. This dataset was the smallest in terms of volume, as the average number of total samples per user per session were 16. This dataset would be referred to as the Phone Acceleration based Gait () in the rest of the paper.

3.2.2 Smartwatch-based Gait Biometric

This dataset contains arm movement patterns of 40 users. The data was collected using the motion sensors (accelerometer, and gyroscope sensors) built into Samsung Galaxy Gear S. The sampling rate for both the sensors were kept to 25Hz. Thirty-four participants were between 20 and 30 years of age, four between 30 and 35, and rest of them were in their 50s. Gender-wise, ten of the subjects were female, while the rest were male. The data was collected for about 2-3 minutes of a walk with watch worn on the wrist. With the 10 seconds of windows with 5 seconds of intervals, the average number of samples per user per session turned out to be 18.4. We replicated the data preprocessing and feature extraction steps as proposed in [6], and used the selected features as advised for creating genuine and impostor samples for training and testing the classifiers. The dataset was divided into two parts based on the type of sensor used for recording the patterns. The arm acceleration pattern recorded through accelerometer sensor will be referred to as Smartwatch Acceleration-based Gait (). Similarly, the arm rotation patterns recorded through the smartwatch gyroscope will be referred to as Smartwatch Rotation-based Gait () in the rest of this paper.

(a) Decision boundaries of , , , and on artificially generated Gaussian (uni-modal) data.
(b) Decision boundaries of , , , and on artificially generated multi-modal data.
Figure 1: Illustration of working philosophies of all four that were studied.

3.2.3 Swiping and Phone Movement Patterns

This dataset consists of swiping patterns along with corresponding underlying phone movement patterns continuously collected through accelerometer sensor while participants browsed specifically designed web pages as well as pages of their choice and answered a set of questions. The data was collected from 28 volunteers in a completely realistic and unconstrained environment for four to seven days. We replicated the data preprocessing and feature extraction steps exactly as proposed Kumar et al. in [2], and used the selected features as advised for creating genuine and impostor samples. The average number of samples per user was 55.35 and 55.82 for training and testing sessions respectively. This dataset was the biggest in terms of volume. We will refer to this dataset as Swiping and Phone Movement Patterns () in the rest of the paper. For this dataset, we replicated only one of the experiments i.e., a feature-level fusion of swiping and phone movements with a single-template framework as presented by Kumar et al. [2].

3.3 Choice of Classifiers

For each dataset, we replicated the experimental setup advised by respective authors in their papers [19],[6], [2]. For the PABG dataset, the authors studied five classifiers: Bayes Network, Logistic Regression, Multilayer Perceptrons, Random Forests, and Support Vector Machines [19]. Similarly, four classifiers (k Nearest Neighbors, Logistic Regression, Multilayer Perceptrons, and Random Forests) were used to implement on and [6], and two classifiers (kNN with Euclidean, and Random Forests) were used to implement for [2]. It is generally difficult to know which classifier fits a data set better in advance. So, we decided to study as many (eight) that covered most of the classifiers which were applied on the above datasets in the past. We used Python’s sklearn package [20] for running these classifiers. The parameter settings of these algorithms were calibrated to get the best possible performance.

The are well studied in authentication domain and are not the focus of this study, therefore, we do not provide any details on how do they work. However, the that we studied have been rarely explored (except ) in this domain, so we briefly discuss their working philosophies in the following paragraphs. The have been successfully applied to solve a variety of one-class problems in the past. The most widely known and established one is one-class Support Vector Machine [21]. Hence was an intentional choice in this study. In addition to , we study Elliptic Envelope (), Isolation Forest (), and Local Outlier Factor (). The reason behind choosing these algorithms was their different working philosophies and distinct decision-making capabilities.

The is an unsupervised method that learns a decision function from the samples supplied to it. The decision function basically is the result of a process that separates all the data points from the origin and maximizes the distance from the hyperplane to the origin. has two important parameters, : determines the upper bound on the fraction of outliers, and allows control to the trade-off between genuine (normal) and impostor (abnormal) predictions, and : a value used as the stopping tolerance that affects the number of iterations for optimizing the model. We standardized all the features using StandardScaler of Python’s sklearn. the StandardScaler standardize features by removing the mean and scaling to unit variance.

During enrollment, only genuine samples were supplied to the while during the verification, both genuine and impostor samples were tested. Based on the learned decision function, made the decision. In addition to the normal/abnormal decisions, returned scores associated with the prediction. We used the scores later for score-level fusion.

The Elliptic Envelop () [22] is a simple outlier detector that assumes the distribution of data as Gaussian. It fits a robust covariance estimate to the supplied data. In other words, it fits an ellipse to the central sample points. The Minimum Covariance Determinant, a robust estimator of covariance was used. The Mahalanobis distances obtained from this estimate was used to derive a measure of abnormality. As is very sensitive to the feature dimensions, hence we applied Principle Component Analysis to find out the principal components and supplied only 30% of the top-ranked components to the . Unlike, , does not assume any parametric form of the data distribution and therefore models the complex shape of the data much better in general.

Generally, the outlier detection in high-dimensional space is very challenging. The Isolation forests [23], however, does a decent job in such scenario compared to other algorithms e.g. that assumes certain underlying distribution (see the rightmost figure of Figure 1(b)). It basically isolates samples by randomly selecting a feature and then randomly selecting a split point between the maximum and minimum values of the selected feature. This process is repeated recursively and is represented by a tree structure. The number of required partitioning to isolate a sample is the path length from the root node to the terminating node. The averaged path length, over a forest of such random trees, is translated as the measure for making the final decision. The shorter the path the more the abnormality.

The Local Outlier Factor () [24] is an unsupervised outlier detector which computes the local density deviation of the given sample with respect to its neighbors. The local density is estimated by the typical distance at which a point can be reached from its neighbors. The samples that have a substantially lower density than their neighbors are considered as outliers. The number of neighbors is an important parameter and is generally kept greater than the minimum number of samples that a cluster contains, and smaller than the maximum number of close by samples that could be potential impostors.

Figure 2: Heatmap of Pearson correlation coefficients computed among the scores predicted by the OCC in order to gauge the usefulness of the fusion. The less the value of correlation coefficient the more useful fusion might be [15].

If the genuine samples from a well-centered elliptical boundary and/or follow a Gaussian distribution, the decision rule based on fitting covariance like would be able to generate a well-separated decision boundary around genuine samples. On the contrary, if the genuine samples do not follow any underlying distribution, the may perform well (see Figure 1(a)). Moreover, if the genuine samples are non-Gaussian or multi-modal, failed to produce any decision boundary for them but , as well as , might be able to generate a reasonable decision boundary (see Figure 1(b)) [25].

3.4 Training and Testing of the Classifiers

Although offer one of the biggest advantages over the i.e. they do not require samples from abnormal (impostor) class at all, they require sufficient training data from the normal (genuine) class for drawing the accurate classification boundary. This phenomenon was observed while setting up the experimental parameters for training verification models as well as user-specific thresholds for authentication systems.

The required both genuine and impostor samples during the training, while could be trained using genuine samples only. We tested all ( and ) classifiers for genuine pass/fail rates using genuine samples and impostor pass/fail rates using the impostor samples. For impostor testing, we borrowed a fixed number of samples from other users than the genuine following the suggestions of Kumar et al. [2, 19, 6].

ABoost 4.58 20.27 12.42 87.58 2.24 24.34 13.29 86.71 2.56 25.38 13.97 86.03 8.33 13.41 10.87 89.13
NBayes 1.96 27.86 14.91 85.09 1.28 25.47 13.38 86.62 2.88 28.76 15.82 84.18 9.79 11.04 10.42 89.58
kNN 13.07 1.48 7.28 92.72 7.56 3.99 5.78 94.22 7.95 8.71 8.33 91.67 14.02 3.87 8.94 91.06
LDA 10.78 3.81 7.30 92.70 6.47 8.69 7.58 92.42 6.28 12.61 9.45 90.55 18.52 5.38 11.95 88.05
LReg 6.21 7.09 6.65 93.35 4.62 17.33 10.97 89.03 6.47 23.41 14.94 85.06 12.04 7.91 9.97 90.03
MLP 7.52 7.61 7.56 92.44 3.40 18.75 11.07 88.93 4.17 22.17 13.17 86.83 15.08 6.72 10.90 89.10
RFC 2.29 14.62 8.45 91.55 0.58 26.58 13.58 86.42 1.47 25.66 13.57 86.43 7.80 12.06 9.93 90.07
SVC 14.05 2.50 8.28 91.72 3.21 10.03 6.62 93.38 4.49 15.10 9.79 90.21 18.12 3.04 10.58 89.42
SV1C 7.03 14.65 10.84 89.16 9.01 13.01 11.01 88.99 11.83 16.45 14.14 85.86 11.71 9.48 10.59 89.41
LOF 6.70 17.78 12.24 87.76 18.46 11.72 15.09 84.91 18.75 11.86 15.31 84.69 12.83 10.89 11.86 88.14
IF 17.16 25.15 21.15 78.85 16.25 22.35 19.30 80.70 12.56 21.28 16.92 83.08 15.15 24.56 19.85 80.15
EE 14.05 27.43 20.74 79.26 19.26 14.38 16.82 83.18 23.62 20.14 21.88 78.12 15.28 15.54 15.41 84.59
Fusion* 8.17 13.61 10.89 89.11 7.37 17.29 12.33 87.67 10.58 17.83 14.20 85.80 11.51 9.24 10.37 89.63
Table 1: The performance of , , and the best fusion of on PABG, WABG, WRBG, and SPMP datasets. The first eight rows (excluding header) presents the average False Accept Rate (FAR), False Reject Rate (FRR), Half Total Error Rates (HTER), and Area Under the Curve (AUC) obtained by on different datasets. The next four rows present the same metrics obtained by four individual . Notably, the performance of individual , especially, and is comparable to most of the except the top three i.e. , and for SPMP dataset which had good amount of genuine samples (on an average 55 per user) for training the . While the performance of other two is poor compared to the top four . Although the performance of the top two is not better, they are still better than half of the across all four datasets. Fusion* represents the combination of classifiers that achieved the best error rates. The combination for PABG and WABG was LOF+SV1C, whereas, for WRBG it was IF+LOF+SV1C, and SPMP, it was IF + LOF + SV1C, and EE+LOF+SV1C respectively.

3.5 Fusion of

We explored the fusion of considering the fact that they work on different philosophies and create different decision boundaries (see Figure 1) as well as the relatively low correlation among the scores obtained by the classifiers (see Figure 2). The fusion of two classifiers could enhance the performance when their decision or scores/decisions are uncorrelated from each other [26, 15, 26]. There exists several other methods to measure the diversity of classifier for usefulness of the fusion, and their relationship to the overall performance that are discussed by Kuncheva et al. [27]. We aim to explore more and their usefulness in the fusion to enhance the overall performance of the system in future.

In our experimental setup, the fusion of the was feasible at both score or decision-levels. The decision-level fusion could not improve the performance of the overall system. Further, we explored the option of training a classifier to fuse the decisions from all four . However, the number of decision samples were too low to train a classifier for , , . So we trained the fusion-classifier for but observed no improvement in the performance compared to the individual . Similarly, we also explored the option of training an using the scores obtained from all four to carry out score-level fusion, but we observed only minor improvements in the overall performance.

One of the biggest challenges that we faced while fusing the score was the normalization of scores obtained by different classifiers on the same scale. In case of , the min-max normalization has been an established solution, especially when the distribution of the score distribution is unknown. Essentially we need to know the both genuine and impostor scores to compute the min and max, however, in case of we only know the genuine scores. To generate the impostor scores, one could collect some impostor data, however, that would be against what we are establishing through this paper, i.e. implementing a continuous authentication system using only genuine samples. In our experiment, we used the following logistic function with different values of that was decided based on the genuine scores obtained by running a validation on the training set itself. Where is the original score and is the normalized scores. We also tested tanh and soft-sign function they worked fine too. The combined score was evaluated against a threshold (derived from the genuine scores obtained on the training data) to make the final decision.

4 Performance Evaluation

The performance of all classifiers was evaluated on four different datasets using false accept rate (FAR), false reject rate (FRR), Half Total Error Rates (HTER), and Area Under the Curve (AUC) [28, 16, 29]. HTER is defined as the average of FAR and FRR. The important hyperparameters of most of the classifiers were calibrated to achieve the best possible error rates. To understand the operational characteristic of the and fusion-based systems, we plotted the Detection Error Trade-off (DET) curve [30, 31] (see Figure 3). These curves were plotted by varying the decision threshold on the scores and computing the mean FAR and FRR for each threshold across the user population of respective datasets. We observed that , , and the fusion is doing significantly well. Also, we can observe that the curve is smooth for the SPMP that had good amount of training samples among all datasets.

Figure 3: Illustration of trade-off between two error measures False Accept Rates (FAR) and False Reject Rates (FRR). These curves were plotted by varying the decision threshold on the scores obtained by the and the best fusion, computing the mean FAR and FRR at every threshold across the user population of respective datasets.The curve of SPMP looks quite consistent and smooth because we had good amount of training data (56 samples per user) compared to the other datasets where we had around 18 samples per user. Another crucial observation is that dominated the fusion heavily.
CL Pair KS Wilc. Fried. CL Pair KST Wilc. Fried. CL Pair KST Wilc. Fried. CL Pair KST Wilc. Fried.
KNN-1CLOF 6.8e-04 9.7e-01 4.4e-01 KNN-1CLOF 4.4e-08 4.5e-03 3.3e-01 KNN-1CLOF 6.7e-08 1.8e-02 8.7e-01 NBayes-1CLOF 5.4e-05 5.1e-01 5.5e-01
LDA-1CLOF 1.5e-03 6.7e-01 2.0e-01 LDA-1CLOF 9.3e-07 1.2e-01 6.2e-01 LDA-1CLOF 6.3e-07 1.3e-01 2.5e-01 KNN-1CLOF 1.8e-05 4.8e-01 2.6e-01
LR-1CLOF 2.3e-03 6.4e-01 7.6e-01 LR-1CLOF 8.9e-06 7.2e-01 8.7e-01 MLP-1CLOF 1.1e-06 7.7e-01 1.3e-01 LR-1CLOF 8.3e-05 4.1e-01 4.5e-01
MLP-1CLOF 1.0e-03 8.9e-01 1.1e-01 SVC-1CLOF 9.5e-07 6.0e-03 4.6e-02 SVC-1CLOF 3.4e-06 6.5e-02 4.1e-01 RFC-1CLOF 4.8e-05 7.4e-01 8.5e-01
KNN-1CSVM 6.8e-04 5.1e-01 7.1e-02 KNN-1CSVM 9.8e-08 8.5e-02 1.0e+00 KNN-1CSVM 6.7e-08 5.8e-02 8.7e-01 NBayes-1CSVM 3.5e-06 5.9e-03 1.7e-03
LDA-1CSVM 2.4e-03 5.5e-01 2.0e-01 LDA-1CSVM 2.5e-06 4.0e-01 6.2e-01 LDA-1CSVM 6.3e-07 2.1e-01 6.2e-01 KNN-1CSVM 1.0e-05 3.5e-03 2.5e-03
LR-1CSVM 2.3e-03 7.6e-01 1.3e-01 LR-1CSVM 8.9e-06 9.0e-01 4.0e-01 MLP-1CSVM 1.1e-06 8.7e-01 1.3e-01 LR-1CSVM 5.5e-06 1.3e-03 2.6e-04
MLP-1CSVM 1.0e-03 8.0e-01 1.1e-01 SVC-1CSVM 9.5e-07 5.9e-02 3.0e-01 SVC-1CSVM 3.4e-06 9.0e-02 8.7e-01 RFC-1CSVM 4.9e-06 2.2e-02 1.2e-01
Table 2: The p-Values obtained from the statistical test conducted to evaluate the significance of difference among the error rates (HTER) obtained by the top four and top two for each dataset. Note that the Friedman test, Wilcoxon signed ranked test, kolmogorov smirnov test, and Classification Pair are abbreviated as Fried., Wilc., KST, and CL Pair in this table.

For comparing the performance of the with the , we evaluated a total of eight multi-class classifiers: AdaBoost (), Naive Bayes (), k-Nearest Neighbors (), Linear Discriminant Analysis (), Logistic Regression (), Multilayer Perceptron (), Random Forest (), and Support Vector Classification () over all datasets. The performance of these algorithms is reported in Table 1. The , , and achieved the best error rates across the datasets. We can see that and outperformed at least four across all datasets. All possible (twelve) combinations of four were evaluated. The dominated across the combinations and dataset. The HTER of the best fusion was either equal to or lower than .

A series of statistical tests were conducted in order to ensure if the difference of error rates among the classifiers (1) was not mere fluctuations due to a few extremely good/bad users i.e. – statistically significant, and (2) holds true for the larger population of users. The tests were conducted pairwise between the top four and top two for each dataset. The top-performing varied across different datasets, however, top performing were consistent across all four datasets. The user-level HTERs were the input to the test methods. The mixed effects Analysis of Variance (MANOVA) method is generally used to test the statistical significance of differences between two samples, in this case, the difference between user-level mean HTERs obtained by the classifiers that were being tested. The MANOVA test assumes that the pairwise difference of mean HTERs across the users follows a Gaussian distribution. To find out if that assumption is true, we applied Kolmogorov-Smirnov (KS) test [32]. The null hypothesis for the KS test was that the difference follows a Gaussian distribution. KS test rejected the null hypothesis at the 5% significance level for all pair of classifiers that we compared. Table 2 presents the p-values corresponding to KS tests were really low. Hence, the MANOVA was ruled out.

We then used Wilcoxon signed rank test that makes no assumption about the underlying distribution of the error difference. The null hypothesis of this test was that the difference between HTERs achieved by two algorithms was significant and will hold for a larger population. The tests failed to reject the null hypothesis at the 5% level. We concluded that the difference of error rates holds true for the current as well as a larger population of users. This conclusion was further verified by using Friedman test [33]. The p-values of the Wilcoxon and Friedman tests are also reported in Table 2.

5 Conclusion and Future Work

Our findings suggest that it is possible to build behavioral biometrics-based continuous authentication systems without using samples from impostor class. Such systems can be implemented by using and their fusion. The and achieved comparable error rates and outperformed half of the eight . The fusion of could not improve the performance of the system significantly, however, if deeply investigated the fusion would reduce the error rates further. Hence, in the future, we aim to explore the fusion further by considering more , and test them on the publicly available behavioral biometric datasets.

6 Acknowledgement

We thank the anonymous reviewers for their insightful feedback. This work was supported in part by National Science Foundation Award SaTC #1527795. During the submission of the paper for review, Partha Pratim Kundu was with Syracuse University.


  • Frank et al. [2013] M. Frank, R. Biedert, E. Ma, I. Martinovic, and D. Song. Touchalytics: On the applicability of touchscreen input as a behavioral biometric for continuous authentication. Trans. Info. For. Sec., 8(1):136–148, January 2013. ISSN 1556-6013.
  • Kumar et al. [a] R. Kumar, V. V. Phoha, and A. Serwadda. Continuous authentication of smartphone users by fusing typing, swiping, and phone movement patterns. In 2016 IEEE (BTAS-2016), a.
  • [3] A. Primo, V. V. Phoha, R. Kumar, and A. Serwadda. Context-aware active authentication using smartphone accelerometer measurements. In CVPRW, 2014, pages 98–105.
  • Zhong et al. [2015] Y. Zhong, Y. Deng, and G. Meltzner. Pace independent mobile gait biometrics. In 2015 IEEE 7th International Conference on Biometrics Theory, Applications and Systems (BTAS), pages 1–8, Sept 2015. doi: 10.1109/BTAS.2015.7358784.
  • Antal and Szabó [2016] Margit Antal and László Zsolt Szabó. Biometric authentication based on touchscreen swipe patterns. Procedia Technology, 2016. 9th International Conference Interdisciplinarity in Engineering, INTER-ENG 2015, 8-9 October 2015, Tirgu Mures, Romania.
  • Kumar et al. [2016] R. Kumar, VV. Phoha, and R. Raina. Authenticating users through their arm movement patterns. CoRR, abs/1603.02211, 2016. URL
  • Juefei-Xu et al. [2012] F. Juefei-Xu, C. Bhagavatula, A. Jaech, U. Prasad, and M. Savvides. Gait-id on the move: Pace independent human identification using cell phone accelerometer dynamics. In 2012 IEEE Fifth International Conference on Biometrics: Theory, Applications and Systems (BTAS), Sept 2012.
  • Antal and Szabó [2015] M. Antal and L. Z. Szabó. An evaluation of one-class and two-class classification algorithms for keystroke dynamics authentication on mobile devices. In 20th International Conference on Control Systems and Computer Science, 2015.
  • Hempstalk et al. [2008] Kathryn Hempstalk, Eibe Frank, and Ian H. Witten. One-class classification by combining density and class probability estimation. ECML PKDD ’08, 2008.
  • Nguyen et al. [2011] Minh Nhut Nguyen, Xiao-Li Li, and See-Kiong Ng. Positive unlabeled learning for time series classification. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Two, IJCAI’11. AAAI Press, 2011.
  • Zheng et al. [2014] Nan Zheng, Kun Bai, Hai Huang, and Haining Wang. You are how you touch: User verification on smartphones via tapping behaviors. In IEEE-ICNP 2014, pages 221–232, Oct 2014. doi: 10.1109/ICNP.2014.43.
  • [12] Manuele Bicego, Enrico Grosso, and Massimo Tistarelli. Face authentication using one-class support vector machines. In Proceedings of the 2005 International Conference on Advances in Biometric Person Authentication, IWBRS’05, Berlin, Heidelberg. Springer-Verlag.
  • Khan and Madden [2010] Shehroz S. Khan and Michael G. Madden. A Survey of Recent Trends in One Class Classification, pages 188–197. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.
  • [14] Sutharshan Rajasegarar, Christopher Leckie, James C. Bezdek, and Marimuthu Palaniswami. Centered hyperspherical and hyperellipsoidal one-class support vector machines for anomaly detection in sensor networks. IEEE-TIFS.
  • Zhao et al. [2015] Zhiruo Zhao, Kishan G. Mehrotra, and Chilukuri K. Mohan. Ensemble algorithms for unsupervised anomaly detection. In Proceedings of the 28th International Conference on Current Approaches in Applied Artificial Intelligence - Volume 9101, New York, NY, USA, 2015. Springer-Verlag New York, Inc.
  • Shen et al. [2013] C. Shen, Z. Cai, X. Guan, Y. Du, and R. A. Maxion. User authentication through mouse dynamics. IEEE TIFS, 2013.
  • Ding and Ross [2016] Y. Ding and A. Ross. An ensemble of one-class svms for fingerprint spoof detection across different fabrication materials. In 2016 IEEE International Workshop on Information Forensics and Security (WIFS), pages 1–6, Dec 2016. doi: 10.1109/WIFS.2016.7823572.
  • Patel et al. [2016] V. M. Patel, R. Chellappa, D. Chandra, and B. Barbello. Continuous user authentication on mobile devices: Recent progress and remaining challenges. IEEE Signal Processing Magazine, 33(4):49–61, July 2016.
  • Kumar et al. [b] R. Kumar, V. V. Phoha, and A. Jain. Treadmill attack on gait-based authentication systems. In 2015 IEEE (BTAS-2015), b.
  • et al. [2011] Pedregosa et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Schölkopf et al. [2000] Bernhard Schölkopf, Robert Williamson, Alex Smola, John Shawe-Taylor, and John Platt. Support vector method for novelty detection, 2000.
  • Rousseeuw and Driessen [1999] Peter J. Rousseeuw and Katrien Van Driessen. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3):212–223, August 1999. ISSN 0040-1706.
  • Liu et al. [2012] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation-based anomaly detection. ACM Trans. Knowl. Discov. Data, 6(1):3:1–3:39, March 2012. ISSN 1556-4681.
  • Ankerst et al. [1999] Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. Optics: Ordering points to identify the clustering structure. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD ’99, 1999.
  • [25] Scikit learn developers. Novelty and outlier detection.
  • Jain et al. [2005] Anil Jain, Karthik Nandakumar, and Arun Ross. Score normalization in multimodal biometric systems. Pattern Recogn., 38(12), December 2005.
  • [27] Ludmila I. Kuncheva and Christopher J. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach. Learn.
  • Point [2012] U.S. Military Academy (USMA) – West Point. Biometrics metrics report v3.0., 2012.
  • sedenka et al. [2015] J. sedenka, S. Govindarajan, P. Gasti, and K.S. Balagani. Secure outsourced biometric authentication with performance evaluation on smartphones. IEEE-TIFS, Feb 2015. ISSN 1556-6013.
  • Martin et al. [1997] Alvin F. Martin, George R. Doddington, Terri Kamm, Mark Ordowski, and Mark A. Przybocki. The det curve in assessment of detection task performance. ISCA, 1997.
  • DET [2015] Guidelines for Best Practices in Biometrics Research - MSU CSE, 2015. IEEE International Conference on Biometrics, ICB 2015, Phuket, Thailand,19-22 May, 2015.
  • Massey [1951] Frank J. Massey. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American Statistical Association, 46(253):68–78, 1951.
  • Friedman [1937] M. Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32(200):675–701, 1937.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description