SynSig2Vec: Learning Representations from Synthetic Dynamic Signatures for Real-world Verification
An open research problem in automatic signature verification is the skilled forgery attacks. However, the skilled forgeries are very difficult to acquire for representation learning. To tackle this issue, this paper proposes to learn dynamic signature representations through ranking synthesized signatures. First, a neuromotor inspired signature synthesis method is proposed to synthesize signatures with different distortion levels for any template signature. Then, given the templates, we construct a lightweight one-dimensional convolutional network to learn to rank the synthesized samples, and directly optimize the average precision of the ranking to exploit relative and fine-grained signature similarities. Finally, after training, fixed-length representations can be extracted from dynamic signatures of variable lengths for verification. One highlight of our method is that it requires neither skilled nor random forgeries for training, yet it surpasses the state-of-the-art by a large margin on two public benchmarks.
Handwritten signatures are the most socially and legally accepted means of personal authentication. They are generally regarded as a formal and legal means to verify a person’s identity in administrative, commercial and financial applications, for example, when signing credit card receipts. Over the last forty years, research interest in automatic signature verification (ASV) has grown steadily, and a number of comprehensive survey papers have summarized the state-of-the-art results in the field till 2018 [plamondon1989automatic, plamondon2000online, impedovo2008automatic, diaz2019perspective]. Nowadays, building an ASV system to separate genuine signatures and random forgeries (produced by a forger who has no knowledge about the authentic author’s name or signature) can be considered a solved task, while to separate genuine signatures and skilled forgeries (produced by a forger after unrestricted practice) still remains an open research problem.
Among the literatures over years, great research effort has been devoted to obtaining good representations for signatures by developing new features and feature selection techniques. With the popularity of deep learning in recent years, several researches have also applied deep learning models to learn representations for dynamic signature sequences or static signature images, and have achieved certain improvements in reducing the verification error against skilled forgeries. However, these methods have several limitations. First of all, they require skilled forgeries as training samples to achieve good performance. One should know that, as a biometric trait and a special kind of private data, handwritten signatures are non-trivial to collect; skilled forgeries, which require the forgers to practise once and again, are even more difficult to acquire. Therefore, these methods can hardly perform equally well when there are only genuine signatures for training. Second, these methods generally lack an appropriate data augmentation method, which is fundamental in training deep learning models. The main reason lies in the following questions: Which type of data augmentation can essentially capture the variance of the underlying signing process? To what extent can an augmented genuine signature maintain its “genuineness”? Third, existing loss functions do not consider fine-grained signature similarities and tend to overfit. For example, in Siamese networks, positive and negative signature pairs are always labelled with 1s and 0s, respectively, regardless of the actual visual similarities.
In this paper, we focus on dynamic signature verification and propose a novel ASV system without the above-mentioned limitations. A basis of our method is the kinematic theory of rapid human movements and its Sigma Lognormal () model [plamondon1995kinematic]. The model hypothesizes that the velocity of a neuromuscular system can be modelled by a vector summation of a number of lognormal functions, each of which is described by six parameters. Rooted in this model, we extract the underlying neuromuscular parameters of genuine signatures and synthesize new signatures by introducing perturbations to the parameters. The level of parameter perturbation controls the level of signature distortion; based on one genuine signature, we can synthesize signatures with various distortion levels, as shown in Fig. 1. Thereafter, many representation learning methodologies can be considered, such as metric learning. In this study, we propose to learn dynamic signature representations through optimizing the average precision (AP) of signature similarity ranking based on the direct loss minimization framework proposed by [song2016training]. This learning strategy has two benefits. First, as a list-wise ranking method, AP optimization can preserve and exploit fine-grained signature similarities in the ranking list. Second, it is expected to improve the performance since AP is closely related to verification accuracy. Signature similarities are computed as cosine similarities of representations extracted from one-dimensional convolutional neural network (CNN).
The main contributions of this paper are three-fold. First, the application of model to signature synthesis not only eliminates the need for skilled forgeries, but also serves as a data augmentation technique. Second, we introduce AP optimization and demonstrate its effectiveness for dynamic signature representation learning. Third, we design a simple yet state-of-the-art CNN structure to extract fixed-length representations from dynamic signatures of variable lengths.
Application of deep learning to dynamic signature verification has not been explored too much due to the difficulty of collecting large datasets. Existing studies in the field can be roughly divided into two categories. The first category learns local representations [lai2018recurrent, wu2019prewarping]. It maintains the temporal information of the input and applies dynamic time warping (DTW) to the learned feature sequence; to this end, specific techniques may be needed during training, such as a modified gated recurrent unit [lai2018recurrent] and signature prewarping [wu2019prewarping]. The second category learns fixed-length global representations [tolosana2018exploring, ahrabian2018usage, park2019robust]. For example, [park2019robust] used CNN and time interval embedding to extract features from dynamic signature strokes, and then utilized recurrent neural networks to aggregate over the strokes. As mentioned in the introduction, the above methods require training with skilled forgeries to enhance the performance when verifying this type of samples, which may be impractical in many situations. Our method only requires genuine signatures because of the introduced synthesis method, and learns global representations using a lightweight CNN.
Two relevant studies to ours also use -based synthetic signatures to train dynamic ASV systems. [diaz2016dynamic] synthesized auxiliary template signatures to enhance several non-deep learning systems, including DTW-based, HMM-based and Manhattan-based systems. Whether synthetic signatures are effective in deep learning was not validated in their study. [ahrabian2018usage] trained and tested on fully synthetic signatures using recurrent autoencoder and the Siamese network. Whether synthetic data helps to verify real world signatures was not investigated. Also, different to these two studies, our method learns to rank the signatures synthesized with different distortion levels, which is a novel idea in the field of ASV.
|Parameters||Admissible Ranges||Distortion Levels|
[bhattacharya2017sigma] did not work on ’s admissible ranges. We decide this range based on our visual tests.
-based Signature Synthesis
The kinematic theory of rapid human movements, from which the model was developed, suggests that human handwriting consists in controlling the pen-tip velocity with overlapping lognormal impulse responses, called strokes, as illustrated in Fig. 2. The magnitude and direction of the velocity profile of the -th stroke is described as:
where is the amplitude of the stroke, the time occurrence, the log time delay, the log response time, and respectively the starting angle and ending angle of the stroke. And the velocity of the complete handwriting movement is considered as the vector summation of the individual stroke velocities:
being the number of strokes. In short, each stroke is defined by six parameters: , and a complete handwriting component is defined by .111The parameters are extracted based on our implementation of [o2009development]. In this study, one complete “component” refers to the trajectory of a pen-down movement, and each component, i.e. pen-down, in the signature is analyzed individually. Although the entire signature can be viewed as a single component by considering the pen-up movements, this practice is less preferred due to two reasons. First, it complicates the parameter extraction process. Second, many current devices and datasets do not record pen-ups.
Based on the extracted parameters in , the component can be reconstructed as follows:
where and are the residual trajectories that are considered (by the parameter extraction algorithm) to contain no valid strokes. By introducing perturbations to the parameters in , new synthetic components can be generated. A new signature can thus be generated by synthesizing every component in the template signature.
, , and are distorted as follows:
while , are distorted as follows:
, , , , , are uniform random variables that decide the signature distortion level, and are fixed for all strokes across a component. A previous study [bhattacharya2017sigma] carried out visual Turing test on synthetic characters and worked out the admissible ranges (in percentage terms) of parameter variation: varying parameters out of these ranges will make the character unrecognizable. We borrow the result for , and ; as for , , , they are empirically restricted in a small range. On this basis, for each genuine signature, two groups of signatures, denoted below as and respectively, are generated with two different distortion levels as shown in Table I. Signatures in have lower distortion levels and should rank higher according to the similarity to the template signature, as compared to those in . In this context, one can regard as augmented genuine signatures, and as synthetic skilled forgeries. These two distortion levels correspond to the low and median distortion levels as illustrated in Fig. 1, and are determined through a simple coarse grid search, leaving room for future improvements.
To preserve and exploit fine-grained signature similarities, we construct one-dimensional CNNs to learn to rank these synthesized signatures, and optimize the AP of the ranking, as described in the following section.
Average Precision Optimization
Given one genuine signature and its synthetic samples, we compute and rank their similarities and incorporate the AP of the ranking into the loss function for optimization. Because AP is non-differential to the CNN’s outputs, we resort to the General Loss Gradient Theorem proposed by [song2016training], which provides us with the weight update rule for the optimization of AP. Specifically, a neural network can be viewed as a composite scoring function , which depends on the input , the output , and some parameters . The theorem (cited here for completeness) states that:
When given a finite set , a scoring function , a data distribution, as well as a task-loss , then, under some mild regularity conditions, the direct loss gradient has the following form:
In Eq. 12, two directions () of optimization can be used. The positive direction steps away from worse parameters , while the negative direction moves toward better ones.
In the context of this paper, refers to the concatenation of the two synthetic signature groups for one given genuine signature , namely , and . And is the collection of all pairwise comparisons, where if is ranked higher than , , and otherwise. We define the scoring function as follows:
and is the embedding function parameterized by a CNN with learnable parameters . The scoring function is inherited from [yue2007support] and measures the cosine similarity of representations from samples and . We notice that similar definitions to Eqs. 15 and 16 have been used for few-shot learning [triantafillou2017few].
Further, let be a vector constructed by sorting the data points according to the ranking defined by , such that if the -th data point belongs to and otherwise. Then, given ground truth and predicted configurations and , the AP loss is
While can be inferred via a dynamic programming algorithm [song2016training]. To prevent overconfidence of the scoring function, we add a regularization term to the AP loss, and obtain the following gradient
Besides the AP loss, we also employ the standard cross-entropy loss for signature classification. Intuitively, these two losses protect the ASV system from skilled forgery and random forgery attacks, respectively. The overall optimization process is given in algorithm 1.
Framework of SynSig2Vec ASV Method
In this part, we describe our overall framework, including signature preprocessing, CNN structure, and how we construct the verifier based on CNN representations.
Signature synthesis and preprocessing. For signature synthesis, each signature component is resampled at 200 Hz, which is the suggested sampling rate for parameter extraction [o2009development]. Then, a Butterworth lowpass filter with a cutoff frequency of 10 Hz is applied to the resampled trajectory to enhance the signal. The filtered component is then used for parameter extraction; based on extracted parameters, a synthetic component is generated and resampled at 100 Hz, which is the sampling rate of most existing dynamic signature datasets. As for real handwritten signatures used in this study (all collected at 100 Hz), they are also filtered with the Butterworth lowpass filter to be consistent with synthetic ones.
As we have omitted the pen-up components, we use a straight line to connect the end of a pen-down component and beginning of the next one. These lines can be viewed as virtual pen-ups and have a constant speed equaling the average speed of the pen-downs. Because the essential difference between genuine signatures and the synthetic ones lies in their velocity profiles, we extract three feature sequences as follows:
where and are the coordinate sequence. These three feature sequences are normalized to have zero mean and unit variance, and then used as inputs for CNN.
Network structure. A one-dimensional CNN with six convolution layers and scaled exponential linear units (SELUs) [klambauer2017self], as shown in Table 2, is employed to learn fix-length representations from signature sequences. Batch normalization is not applied, because during training each batch consists of only one genuine signature and its synthesized samples, which are non-i.i.d samples. Nevertheless, SELU provides an alternative normalization effect, and is found to work surprisingly well in our study. Because the signature length may vary after synthesis, we pad all signatures inside a batch with zeros to the maximal length. And a corresponding mask is generated to perform masked average pooling from feature sequences coming out of the sixth convolutional layer.
The receptive field of the sixth convolutional layer is 54. For dynamic signatures sampled at 100 Hz, such a receptive field covers a time interval of 0.5 second and captures several lognormal strokes. We have experimented with deeper networks and larger receptive fields, but found no significant further improvements. We have also explored the residual connections, but found they degraded the performance.
A 256-dimensional feature vector is obtained from the masked average pooling layer, and then goes into two branches. The first branch is a fully connected layer with softmax activation, and the cross-entropy loss is computed on top of it. The second branch is a fully connected layer with 512 neurons, on top of which the AP loss is computed. The AP loss and the cross-entropy loss are used together to optimize the network parameters as described in Algorithm 1. After training, only the second branch is kept. Then, for any given dynamic signature, a 512-dimensional feature vector can be extracted as its representation.
|1||1D Convolution, 64, k7, s1, p3|
|2||1D Max Pooling, k2, s2|
|3||1D Convolution, 64, k3, s1, p1|
|4||1D Convolution, 128, k3, s1, p1|
|5||1D Max Pooling, k2, s2|
|6||1D Convolution, 128, k3, s1, p1|
|7||1D Convolution, 256, k3, s1, p1|
|8||1D Max Pooling, k2, s2|
|9||1D Convolution, 256, k3, s1, p1|
|10||Masked 1D Average Pooling|
|11||FC with softmax||FC, 512|
Verifier. We use a distance-based verifier with the same normalization technique as in [lai2018recurrent]. Specifically, given two signatures and , we compute the Euclidean distance of their normalized vectors:
where is the 512-dimensional feature vector extracted from the CNN. Given template signatures from client , we compute the average pair-wise distances as ( if ). Then, for a test signature claimed to be client , we compute the following scores:
From these scores, the average score and minimum score are computed, leading to a 2-D scatter plot on which we fit a line and make the decision. In practice, we find simple already works well. By varying the threshold , we can obtain the equal error rates (EERs) to assess the system performance. Unless mentioned otherwise, a global threshold for all individuals is used.
Datasets and protocols
Two benchmark dynamic signature datasets were used in this study, namely MCYT-100 [ortega2003mcyt] and SVC-Task2 [yeung2004svc2004]. The MCYT-100 dataset consists of 100 individuals with 25 genuine signatures and 25 skilled forgeries for each individual. The SVC-Task2 dataset contains 40 individuals with 20 genuine and 20 forged signatures per individual.
For MCYT-100, we used a 10-fold cross validation. The -th fold corresponded to the -th ten individuals, and we trained the models in a round-robin fashion on all but one of the folds. Skilled forgeries were not included in the training set. In the testing stage, we considered two scenarios, namely T5 and T1. In scenario T5, five genuine signatures were randomly selected as templates for each individual in the test fold, while the rest 20 genuine and 25 forged signatures were used for testing; EERs were computed and averaged over 50 trials. In scenario T1, each genuine signature was considered as a single template to test against the rest signatures. Finally, for both scenarios, EERs were averaged again over 10 test folds.
For SVC-Task2, similarly we used a 10-fold cross validation and considered scenarios T5 and T1. The training set only included real signatures, therefore both the network and learning algorithm should be data-efficient.
For signature synthesis, to accelerate training, synthetic signatures were first generated offline to create two data pools and for each genuine signature. Then, during training, and were drawn from and respectively. We set , and . Therefore, the batch size was .
For AP optimization, we chose the positive direction and was set as 5 for MCYT-100 and 10 for SVC-Task2. A larger value of for SVC-Task2 led to better generalization because of the small dataset scale. The models were optimized using stochastic gradient descent. The learning rate, momentum and weight decay were set as 0.001, 0.9 and 0.001, respectively. We trained for batches, where was the number of classes in the training set (90 for MCYT-100 and 36 for SVC-Task2). For the proposed method, the final models were evaluated to report the EERs. For the methods to be compared, the best models were evaluated to see their capacities.
First, based on synthetic signatures, we compared AP loss with triplet loss and binary cross-entropy (BCE) to examine its properties. For triplet loss, the distance metric in Eq. 23 was used. Eight hardest triplets were mined from a total of triplets, and the margin was set as 0.25; pair-wise distances within were also added to the loss to minimize the intra-class variance. For BCE, cosine similarity in Eq. 16 was computed, rescaled, and activated by the sigmoid function. Within a batch, there were positive and negative pairs, which were labeled as (0.9, 0.1) and (0.5, 0.5) respectively. The loss of positive pairs was doubled for balance.
The EER curves (using a global threshold), as functions of the number of trained batches, are shown in Fig. 3. First, we can see that the AP loss consistently outperformed the other two losses. Second, the AP loss was more robust against overfitting, and continued to decrease the EERs in late training iterations. Third, somewhat surprisingly, BCE and the triplet loss presented different behaviors on two datasets. A possible reason is that, the triplet loss involved much complexity in hard sample mining and determining a proper margin, which should be carefully treated for different datasets. Detailed EERs are given in Table 3. On the SVC-Task2 dataset, previous best EERs are 7.80% in scenario T5 and 18.25% in scenario T1, and our method reduces the EERs by 40.4% and 34.5%, respectively. On the MCYT-100 dataset, our method reduces the EERs by 5.6% and 59.4% in scenarios T5 and T1, respectively. The great performance improvement demonstrates that our model could extract intrinsic and robust representations from real-world signatures via learning from synthesized ones.
We further compared synthetic signatures with real handwritten ones. Specifically, we compared the following cases:
Signatures in and were replaced with genuine signatures and skilled forgeries, respectively;
Signatures in were replaced with skilled forgeries;
Signatures in were replaced with genuine signatures;
Signatures in both and were synthetic.
All models were trained in exactly the same way using the AP loss, and the EER curves are shown in Fig. 4. There were two important observations. First, we can see that synthesized signatures were even more effective than real handwritten signatures for training, because they were constructed from disturbed parameters and therefore tightly bounded the template signatures. Second, case 3 converged the fastest, but led to slight overfitting on SVC-Task2 and severe overfitting on MCYT-100. Therefore, when using case 3, a representative validation set is necessary; when using case 4, we can simply train for a large number of batches and use the final models, which are generally also the best-performing ones. Detailed EERs are given in Table 4.
Comparison with state-of-the-art
|MCYT-100||SRSS based on model [diaz2016dynamic]||1||13.56||-|
|Symbolic representation [guru2017interval]||5||5.70||2.20|
|DTW warping path score [sharma2017exploration]||5||2.76||1.15|
|DTW with SCC [xia2017signature]||5||-||2.15|
|Recurrent adaptation networks [lai2018recurrent]||5||1.81||-|
|SVC-Task2||SRSS based on model [diaz2016dynamic]||1||18.25||-|
|DTW warping path score [sharma2017exploration]||5||7.80||2.53|
|DTW with SCC [xia2017signature]||5||-||2.63|
In Table 5, we compare our method with state-of-the-art methods on the MCYT-100 and the SVC-Task2 datasets. Our method has achieved substantial improvements over the previous methods, especially in scenario T1 where only one template is available. In scenario T1, our method reduces the EERs by 59.4% (=(13.56-5.50)/13.56*100%) and 34.5% (=(18.25-11.96)/18.25*100%) on the MCYT-100 and the SVC-Task2 datasets, respectively.
Limitations and future work
Some issues in SynSig2Vec need further study, such as the effects of signature distortion levels and the number of synthetic signatures. Besides, SynSig2Vec only uses the pen-down components and is expected to have further improvements in future work by considering the pen-ups if available.
In this paper, we propose to learn dynamic signature representations through ranking synthesized signatures. The model is introduced to synthesize two groups of signatures for each given genuine signature. Signatures in the first group have lower distortion levels and should rank higher according to the similarity to the template signature, as compared to those in the second group. We construct a lightweight one-dimensional CNN to learn to rank these synthesized samples, and incorporate the AP of ranking into the loss function for optimization. Our method only requires genuine signatures for training, yet substantially improves the state-of-the-art performance on two public benchmarks. Particularly, when only one template signature is available for the verifier, our method surpasses the state-of-the-art on the MCYT-100 benchmark by 8.06%, and on the SVC-Task2 benchmark by 6.29%, showing its significant effectiveness and great potential.
This research is supported in part by NSFC (Grant No.: 61936003), the National Key Research and Development Program of China (No. 2016YFB1001405), GD-NSF (no.2017A030312006), Guangdong Intellectual Property Office Project (2018-10-1), and GZSTP (no. 201704020134).