Domain Adaptation for Ear Recognition Using Deep Convolutional Neural Networks

Domain Adaptation for Ear Recognition Using Deep Convolutional Neural Networks

Fevziye Irem Eyiokur11footnotemark: 1*, Dogucan Yaman*, Hazım Kemal Ekenel Department of Computer Engineering
Istanbul Technical University,
Email: {eyiokur16, yamand16, ekenel}@itu.edu.tr
Abstract

In this paper, we have extensively investigated the unconstrained ear recognition problem. We have first shown the importance of domain adaptation, when deep convolutional neural network models are used for ear recognition. To enable domain adaptation, we have collected a new ear dataset using the Multi-PIE face dataset, which we named as Multi-PIE ear dataset. To improve the performance further, we have combined different deep convolutional neural network models. We have analyzed in depth the effect of ear image quality, for example illumination and aspect ratio, on the classification performance. Finally, we have addressed the problem of dataset bias in the ear recognition field. Experiments on the UERC dataset have shown that domain adaptation leads to a significant performance improvement. For example, when VGG-16 model is used and the domain adaptation is applied, an absolute increase of around 10% has been achieved. Combining different deep convolutional neural network models has further improved the accuracy by 4%. It has also been observed that image quality has an influence on the results. In the experiments that we have conducted to examine the dataset bias, given an ear image, we were able to classify the dataset that it has come from with 99.71% accuracy, which indicates a strong bias among the ear recognition datasets.

Ear recognition, deep learning, domain adaptation
AAfootnotetext:  AAfootnotetext: *The authors have equally contributed.AAfootnotetext: This paper is a postprint of a paper submitted to and accepted for publication in IET Biometrics and is subject to Institution of Engineering and Technology Copyright. The copy of record is available at the IET Digital Library.

1 Introduction

Human identification through biometrics has been both an important and popular research field. Among the biometric traits, ear is a unique part of the human body in terms of different features such as shape, appearance, posture, and there is usually not much change in the ear structure except that the ear length is prolonged over time [1]. Various studies have been conducted and many different approaches have been proposed on ear recognition, however, it still remains as an open challenge, especially when the ear images are collected under uncontrolled conditions as in the Unconstrained Ear Recognition Challenge (UERC) [2].

Ear recognition approaches are mainly categorized into four groups, holistic, local, geometric, and hybrid processing [1]. In the earlier studies, the most popular feature extraction methods for ear recognition were SIFT [11], SURF [12], and LBP [13]. Due to the popularity of deep learning in recent years and its significant impact on the computer vision field [4, 5, 6, 20, 23], deep convolutional neural networks (CNN) based approaches have also been adopted for ear recognition [2, 3, 23]. CNNs mainly require a large amount of data for training. However, the amount of samples in the datasets available for ear recognition are rather limited [1, 2, 7, 8, 9, 10]. Due to this limitation, CNN-based ear recognition approaches mainly utilize an already trained object classification model, so called a pretrained deep CNN model, from one of the well-known, high performing CNN architectures, for example [4, 5, 6]. These pretrained models were trained on the ImageNet dataset [19] for generic object classification purposes, therefore, they are required to be adapted to the ear recognition problem. This adaptation is done mainly with a fine-tuning process, where the output classes are updated with subject identities and the employed pretrained deep CNN model is further trained using the training part of an ear dataset.

In the field of ear recognition, most of the used datasets have been collected under controlled conditions, and therefore, very high recognition performance has been achieved on them [1]. But the proximity of these accuracies to the real world is a topic of debate. Because of this, in-the-wild datasets have been collected in order to imitate real-world challenges confronted in ear recognition better [1]. These datasets, since they contain images collected from the web, have a large variety, for example, in terms of resolution, illumination, and use of accessories. Sample ear images shown in Fig. 1 are from the UERC dataset. It can be seen from Fig. 1 that there are accessories, partial occlusions due to hair, and also pose and illumination variations. Because of these significant appearance variations, the performance of the ear recognition systems on the wild datasets, such as on the UERC, is not as high as the ones obtained on the datasets collected under controlled conditions.

Fig. 1: Sample ear images from the UERC dataset [2]. The dataset contains many appearance variations in terms of ear direction (left or right), accessories, view angle, image resolution, and illumination.

In this paper, we present a comprehensive study on ear recognition in the wild. We have employed well-known, high performing deep CNN models, namely, AlexNet [4], VGG-16 [5], and GoogLeNet [6] and proposed a domain adaptation strategy for deep CNN-based ear recognition. We have also provided an in depth analysis of several aspects of ear recognition. Our contributions are summarized as follows:

  • We have proposed a two-stage fine-tuning strategy for domain adaptation.

  • We have prepared an ear image dataset from Multi-PIE face dataset, which we named as Multi-PIE ear dataset. As can be seen in Table I, this database contains a larger number of ear images compared to the other ear datasets.

  • We have analyzed the effect of data augmentation and alignment on the ear recognition performance.

  • We have performed deep CNN model combination to improve accuracy.

  • We have examined varying aspect ratios of ear images and the illumination conditions they contain, and assess their influence on the performance.

  • We have investigated the dataset bias problem for ear recognition.

For the experiments, we have used the Multi-PIE ear and the UERC datasets [2]. Since Multi-PIE ear dataset is collected under controlled conditions, the achieved results were very high. From the experiments on the UERC dataset, we have shown that the proposed two-stage fine-tuning scheme is very beneficial for ear recognition. With data augmentation and without alignment, for AlexNet [4], the correct classification rate is increased from 52% to 56.46%. For VGG-16 [5] and GoogLeNet [6], the increase is from 54.2% to 63.62% and from 55.02% to 60.91%, respectively. Combining different deep convolutional neural network models has led to further improvement in performance by 4% compared to the single best performing model. We have observed that data augmentation enhances the accuracy, whereas performing alignment did not improve the performance. However, this point requires further investigation, since only a coarse alignment has been performed by flipping the ear images to one side. Experimental results show that the ear recognition system performs better, when the ear images are cropped from profile faces. Very dark and very bright illumination causes missing details and reflections, which results in performance deteriorations. Experiments to examine the dataset bias have indicated a strong bias among the ear recognition datasets.

The remainder of the paper is organised as follows. A brief review of the related work on ear recognition is given in Section 2. The employed methods in this work are explained in Section 3. In Section 4, experimental results are presented and discussed. Finally, Section 5 provides conclusions and future research directions.

2 Related Work

Many studies have been conducted in the field of ear recognition. In the following paragraphs, we give a brief overview. A comprehensive analysis of the existing studies in the area of ear recognition has been presented in [1]. Please refer to this paper for an extensive survey.

In [1], an in-the-wild ear recognition dataset AWE and an ear recognition toolbox for MATLAB are introduced. The AWE dataset has become a useful dataset for the ear recognition field, which has previously employed ear datasets that have been collected under controlled conditions. The presented toolbox enables feature extraction from images with traditional, hand-crafted feature extraction methods. The toolbox also provides use of different distance metrics and tools for classification and performance assessment.

Recently, a competition, unconstrained ear recognition challenge (UERC), was organized [2]. The UERC dataset is introduced for this competition. For the benchmark, training and testing sets from this dataset are specified. In the competition, mainly hand-crafted feature extraction methods, such as LBP [13] and POEM [30], and CNN-based feature extraction methods are used. One of the proposed methods in this challenge eliminates earrings, hair, other obstacles, and background from the ear image with a binary ear mask. Recognition is performed using the hand-crafted features. In another proposed approach, the score matrices calculated from the CNN-based features and hand-crafted features are fused. The remaining approaches participating to the competition employ only CNN-based features.

In [21], a new feature extraction method named Local Similarity Binary Pattern (LSBP) is introduced. This new method, which is used in conjunction with the Local Binary Pattern (LBP) features, is found to have superior ear recognition performance [21]. The proposed feature extraction method provides information both about connectivity and similarity.

In a recent study [23], a brief review of deep learning based ear recognition approaches is given. When performed on the ear datasets that contain ear images collected under controlled conditions, deep learning-based approaches provide satisfactory results. However, it has been emphasized that the detection of an ear in the image is a difficult task.

Another study that employed deep CNN models is presented in [3]. In this work, AlexNet [4], VGG-16 [5], and SqueezeNet [20] architectures are used. Two different training approaches are applied, namely training of the whole model, called full model learning and training of the last layers by using a pretrained deep CNN model, called selective model learning. The best results are obtained with the SqueezeNet. Data augmentation has been applied to increase the amount of data for deep CNN model training. So called selective model learning, using the pretrained models that were trained on the ImageNet dataset, was found to perform better than using so called full model learning in terms of ear recognition performance.

3 Methodology

In this section, we present the employed deep convolutional neural network models, data augmentation and transfer learning approaches, and provide information about the datasets, data alignment, and fusion techniques.

3.1 Convolutional Neural Networks

Fig. 2: Selected view angles from the Multi-PIE face dataset [14, 15]
Fig. 3: Illustration of ear detection and cropping on the Multi-PIE face dataset [14, 15]: (a) Input image, (b) Ear detected image, (c) Cropped ear image

In our study, we have employed convolutional neural networks for ear image representation and classification. CNN contains several layers that perform convolution, feature representation, and classification. Convolutional part of the CNNs includes layers that perform many operations, such as convolution, pooling, batch normalization [16], and these layers are sequentially placed to learn the discriminative features from the image. Then, in the later layers, these features are utilized for classification. In this work, for the final layer, we have used the softmax loss in the employed deep CNN models.

The first deep convolutional neural network architecture used in this study is AlexNet [4], which is the winner model of ILSVRC 2012 challenge [17]. In AlexNet [4], there are five convolutional layers and three fully connected layers. Dropout method [18] has been used to prevent overfitting. Besides, we have also utilized VGG [5] and GoogLeNet [6] architectures. GoogLeNet [6] has 22 layers, however, has about twelve times fewer parameters than AlexNet [4], and it is based on a new paradigm, which is named as inception. In inception layers, input image is filtered by different filters separately. Results of all different filters are utilized, which is very beneficial in terms of extracting multiple features from the same input data. VGG architecture has two versions. One of them contains 16 layers and is named as VGG-16, whereas the other one has 19 layers and is named as VGG-19. VGG-16 has two fully connected layers and softmax classifier after convolutional layers as in AlexNet [4]. VGG-16 [5] is a deeper network than AlexNet [4] and uses a large number of filters of small size, i.e. .

3.2 Transfer Learning, Domain Adaptation and Alignment

Transfer learning has been applied mainly in two different ways in convolutional neural networks and depends on the size and similarity between the pretraining dataset and the target dataset. The first common approach is to utilize a pretrained deep CNN model directly to extract features from the input images. These extracted features are then fed into, for example, a support vector machine classifier, to learn to discriminate different classes from each other. This scheme is employed when the target dataset contains a small amount of samples. The second approach is fine-tuning the pretrained deep CNN models on the target dataset. That is, to initialize the network weights with the pretrained model and to further train and fine-tune the weights on the target dataset. This method is useful when the target dataset has sufficient amount of training samples, since performing fine-tuning on a target dataset with few training samples can lead to overfitting [25]. Depending on the task similarity between the two datasets and amount of available training samples in the target dataset, one can decide between these two approaches [26].

In our work, by using the pretrained models of AlexNet [4], VGG-16 [5], and GoogLeNet [6] architectures, which were trained on the ImageNet dataset [19], we have fine-tuned them on the ear datasets. The ear recognition datasets contain a limited amount of training samples, for example the ones used in this study contain around a thousand to ten thousand ear images. This amount of training data is sufficient for fine-tuning, although it would not be enough to train a deep CNN model from scratch. In our previous work on age and gender classification [24], we have shown that transferring a pretrained deep CNN model can provide better classification performance than training a task specific CNN model from scratch, when only a limited amount of data is available for the task at hand, as in the case for ear recognition. We have further shown that transferring a CNN model from a closer domain, that is for age and gender classification transferring a pretrained model that were trained on face images, instead of one trained on generic object images, provides better performance. By utilizing this information, we have performed a two-stage fine-tuning of the pretrained deep CNN models for ear recognition. For this approach, we have first constructed an ear dataset from the Multi-PIE face dataset [14, 15]. Then, we have fine-tuned the pretrained deep CNN models on this dataset. This way, we first provide a domain adaptation for the pretrained deep CNN models. In the second stage, we perform the final fine-tuning operation by using the target dataset, which is the UERC [2] dataset, in this work. This final fine-tuning stage provides a more specific domain and/or task adaptation. In our case, it is the adaptation required for the wild, uncontrolled conditions. This step is indeed also very important, since as we have shown in the experiments, there exists a dataset bias [27] among the ear recognition datasets.

While performing fine-tuning, parameters have been initialized with the values that came from the pretrained network models. The learning rate of last fully connected layer has been increased by ten times. This is a commonly used strategy in fine-tuning, since the early layers mainly focus on low-level feature extraction and the later layers are mainly responsible for classification. Global learning rate is selected as 0.0001 for AlexNet [4] and GoogLeNet [6], and 0.001 for VGG-16 [5] during fine-tuning on the Multi-PIE ear and UERC datasets [2]. The learning rate is divided by ten in every 20k iterations in AlexNet [4] and VGG-16 [5].

Since alignment is a critical factor in visual recognition tasks, to investigate its impact, we have performed fine-tuning with two different setups. In the first one, both right and left side of ear images have been used directly. In the second approach, the training data have been aligned to the same direction and then fine-tuning has been done with these flipped images. That is, all ear images are aligned only to the left side ear or to the right side ear. This setup has been used to reduce the amount of appearance variations within the classes.

3.3 Data Augmentation

Since the number of images in the UERC dataset [2] is limited, in order to increase the amount of data as well as to account for appearance variations due to image transformations, we have applied data augmentation. Data augmentation has also been applied to the Multi-PIE ear dataset. Although the Multi-PIE ear dataset contains around eight times more images than the UERC dataset [2], it would still benefit from data augmentation. In this work, data augmentation is performed by using the Imgaug toolBBBhttp://github.com/aleju/imgaug.

For data augmentation, different transformations have been used and many images have been created from a single image. First of all, some images that are pixel resolution are randomly cropped from images of size pixels. Then, in the setup used without alignment, the flipped versions of the images have been produced. Images have been generated at different brightness levels by adding or subtracting values to the pixels’ intensity values. These values have been prepared by incrementing by ten in the range of [-55 +55], e.g. (-55, -45, … +45, +55). Another way of modifying brightness levels of the images have been performed by multiplying the pixels’ intensity values with a constant. For this, the values are increased by step size of 0.1 between 0.5 and 1.5. To apply Gaussian blur, we have used different sigma values, which are 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, and 2. Sharpening is applied on each image by selecting values from 0.5 to 2.0 with increasing by step size of 0.1 (0.5, 0.6, 0.7 and etc.). This parameter adjusts the lightness/brightness of the output image. With pixel dropout, from images some pixels are dropped and noisy images are created to increase the generalization of the deep learning model. With contrast normalization, images are created in different contrasts. Scale, translate, rotate, and shear methods have been used to increase image variety. For rotation, the angle values in the range of -20 to +20 degrees are used with step size of five degrees. For shear values, again with step size of five degrees, values between -15 and +15 degrees are used. These augmentation parameters are the ones that we have applied to the UERC dataset [2]. In the augmentation, we have applied to the Multi-PIE ear dataset, fewer parameters were used. After these processes, roughly 220.000 training images for UERC dataset [2] and around 400.000 training images for Multi-PIE ear dataset have been obtained.

Fig. 4: UERC dataset distribution of number of images with respect to image resolution in (a) training set, (b) test set.
Fig. 5: UERC dataset percentages of ear images with respect to aspect ratio.

3.4 Datasets

3.4.1 Multi-PIE Ear Dataset

Multi-PIE face dataset contains 337 subjects, whose images are acquired, as the name implies, under different pose, illumination, and expression conditions [14, 15]. Due to the large amount of profile and close-to-profile images available in the Multi-PIE dataset, we have utilized it to create an ear dataset, which we named as Multi-PIE ear datasetCCCThe list of image filenames and corresponding ear bounding boxes are available at https://github.com/irmdgcn/ear_recognition. The view angles that have been selected for ear dataset creation can be seen from Fig. 2. Ear detection has been performed using an ear detection implementation for OpenCV [28]. A sample ear detection output is shown in Fig. 3. Since we have used a generic ear detector, the detection accuracy on the Multi-PIE dataset is not very high, 28.3%, therefore, the ears have been detected successfully only in a subset of the images. Consequently, the new ear dataset that we have obtained from the Multi-PIE face dataset [14, 15] contains around 17.000 ear images of 205 subjects. This ear dataset has been used for domain adaptation for ear recognition.

3.4.2 UERC Dataset

In the ear recognition field, most of the datasets have been collected under controlled conditions, such as in a laboratory environment. Unlike these datasets, the UERC dataset [2] has been collected from the wild, that is, it consists of ear images of varying quality collected from the web. Because of this, ear identification problem on the UERC dataset [2] is a more challenging task. The UERC dataset is divided into two parts as training and testing sets. In total, there are 11804 ear images of 3706 subjects. Training part of the UERC dataset contains 2304 images of 166 subjects and testing part has 9500 images of 3540 subjects. Following the experimental setup in Emersic et. al. [3], just the training part of this dataset has been used for the experiments. Our experimental results for the test part of the UERC dataset can be found in the unconstrained ear recognition challenge summary paper [2]. Briefly, we have proposed two approaches in [2]. The first one was a CNN-based approach utilizing the VGG-16 architecture, which attained 6.1% rank-1 recognition rate. The second one, which achieved the best score in our experiments with 6.9% rank-1 recognition rate, was a fusion-based approach combining the scores from the VGG-16 framework with the ones from the hand-crafted LBP descriptors. The reason to follow the experimental setup in [3], instead of the one in [2], is the high number of very low resolution images that exist in the test set, which causes problems to interpret the results and analyze the impact of the experimented factors. Distributions of the number of samples with respect to image resolution —in terms of the total amount of pixels contained in the image— are given in Fig. 4 for the UERC training and testing datasets separately. As can be seen, most of the ear images in the testing set of the UERC dataset are of low resolution, with the majority containing less than one thousand pixels. UERC training set has a more even distribution and contains more ear images with better resolutions, i.e. having more than ten thousand pixels. The training part of the UERC dataset is created by combining the AWED (1000 images), CVLED (804 images) datasets, and 500 extra images that have been collected from the web [1, 3]. In the rest of the paper, UERC experiments refer to the experiments conducted on the training part of the UERC dataset as in [3]. We have also analyzed the aspect ratio vs. number of images in the training part of the UERC dataset. As can be seen in Fig. 5, aspect ratio of the images varies significantly, due to differences in ear shapes and viewing angles, which makes the unconstrained ear recognition problem even more challenging.

Dataset # Images # Subjects
AWE [1] 1000 100
AMI [10] 700 100
WPUT [9] 2071 501
IITD [8] 493 125
CP [7] 102 17
UERC Train [2] 2304 166
Multi-PIE Ear [14, 15] 17183 205
TABLE I: Ear datasets

3.4.3 Other Ear Datasets

There are many other ear datasets, which have been collected under controlled conditions, such as Carreira-Perpinan (CP) [7], Indian Institute of Technology Delhi (IITD) [8], AMI [10], West Pommeranian University of Technology (WPUT) [9], and AWE [1] datasets as listed in Table I. The CP dataset [7] contains 102 images belonging to 17 subjects. All ear images in this dataset have been captured from the left side. There exists no accessories or occlusions. The second dataset IITD [8] contains 493 images of 125 subjects and all images are from the right side of the ear. Accessories exist in this dataset. The third one is AMI [10] and contains 700 images of 100 different subjects. Both sides of the ears are available in this dataset. However, there are no accessories. Another ear dataset is WPUT [9] that includes 501 subjects and 2071 ear images. Accessories exist in this dataset. The last one is AWE dataset [1], which is also included in the UERC dataset. There are 1000 ear images of 100 subjects in this dataset. These 100 subjects are the first 100 subjects of the UERC dataset [2, 3]. Many studies have been conducted on these datasets in previous studies and the performance of the proposed approaches on the ones that are collected under controlled conditions are very high. However, as shown in the unconstrained ear recognition challenge [2], ear recognition in-the-wild poses several difficulties causing lower recognition accuracies. In our work, along with the Multi-PIE ear dataset and the UERC dataset, we have utilized these other datasets, especially to investigate whether there exists a dataset bias in the ear recognition field. Sample images from these datasets can be seen in Fig. 7.

Name Formula
Basic
d2s
d2sr
avg-diff
diff1
TABLE II: Confidence score calculation formulas

3.5 Fusion

In order to improve the accuracy further, we have utilized model fusion. The classification outputs of different deep CNN models are combined according to their confidence scores for each image. We have employed different confidence score calculation methods as listed in Table II. In the table, array s contains prediction percentages obtained by the model in a sorted order —large to small—, that is, it contains raw classification scores. The array c contains confidence scores, which are calculated by using the formulas listed in the table. The deep CNN model with the highest confidence score for an image is accepted as the most reliable model for that image. In this work, model combination is applied for the experiments on the UERC dataset [2]. AlexNet [4], VGG-16 [5], and GoogLeNet [6] models are combined with each other.

Models Accuracy Augmentation Alignment
AlexNet 96.71% + +
AlexNet 99.81% +
AlexNet 97.64%
VGG-16 100% + +
VGG-16 100% +
VGG-16 98.57%
GoogLeNet 97.80% + +
GoogLeNet 99.32% +
GoogLeNet 98.45%
TABLE III: Multi-PIE ear dataset test results
Models Accuracy Fine-Tuning Aug. Align
AlexNet [3] 49.51% ImageNet +
VGG-16 [3] 51.25% ImageNet +
SqueezeNet [3] 62.00% ImageNet +
AlexNet 49.51% ImageNet
AlexNet 52.00% ImageNet +
AlexNet 53.20% Multi-PIE
AlexNet 56.46% Multi-PIE +
AlexNet 56.02% Multi-PIE + +
VGG-16 51.03% ImageNet
VGG-16 54.2% ImageNet +
VGG-16 58.84% Multi-PIE
VGG-16 63.62% Multi-PIE +
VGG-16 62.64% Multi-PIE + +
GoogLeNet 54.72% ImageNet
GoogLeNet 55.02% ImageNet +
GoogLeNet 55.37% Multi-PIE
GoogLeNet 60.91% Multi-PIE +
GoogLeNet 60.58% Multi-PIE + +
TABLE IV: UERC dataset test results

4 Experimental Results

We have conducted the ear recognition experiments on the Multi-PIE ear dataset and the UERC dataset [2]. The other ear datasets have been used to assess dataset bias. Multi-PIE ear dataset is divided into three parts as train, validation, and test set. 80% of the dataset has been used for training, 10% has been used for validation, and the remaining 10% has been employed for testing. The experimental setup for the experiments on the UERC dataset [2] is the same as the one in Emersic et. al. [3]. of the dataset has been used for training and the remaining has been used for testing. Data augmentation and alignment have been applied on the training part of the Multi-PIE ear dataset and the UERC dataset [2].

In the experiments, for deep convolutional neural network model training, images have been resized to pixels resolution. These sized images are cropped into five different images during the training phase and a single crop is taken from the center of the image during the test phase. The crop image size for GoogLeNet [6] and VGG-16 [5] models is , while for AlexNet [4] it is .

4.1 Evaluation on the Multi-PIE Ear Dataset

We have first assessed the performance of the deep CNN models on the collected Multi-PIE ear dataset. AlexNet [4], VGG-16 [5], and GoogLeNet [6] architectures have been employed and fine-tuned using their pretrained models that were trained on the ImageNet dataset [19]. The obtained results on the test set are listed in Table III. In the table, the first column contains the name of the model, the second one contains the corresponding classification accuracy, and the third and fourth ones indicate whether augmentation and alignment have been applied or not. As can be seen, the achieved classification rates are quite high due to the controlled nature of the Multi-PIE ear dataset. VGG-16 model [5] is found to perform the best. Data augmentation has contributed around to the accuracy. Alignment did not lead to an improvement. However, this point requires further investigation, since no precise registration of the ear images has been done and they are only aligned roughly to one side.

Fig. 6: UERC dataset test results (a) Sample ear images of different aspect ratios and the corresponding error rates for each aspect ratio interval, (b) Sample ear images of different average intensity values and the corresponding error rates for each average intensity interval.

4.2 Evaluation on the UERC Dataset

For the UERC dataset experiments, we have followed the experimental setup in [3]. As in the experiments on the Multi-PIE ear dataset, AlexNet [4], VGG-16 [5], and GoogLeNet [6] architectures have been employed and fine-tuned using their pretrained models that were trained on the ImageNet dataset [19]. However, this time we have also applied a two-stage fine-tuning as described in Section 3.2, that is we have first fine-tuned the pretrained deep CNN model on the Multi-PIE ear dataset and then fine-tuned the obtained updated model further on the training part of the UERC dataset. The experimental results are given in Table IV. In the table, the first column contains the name of the model, the second one contains the corresponding classification accuracy, the third one shows whether a single or two stage fine-tuning is applied, and the fourth and fifth ones indicate whether augmentation and alignment have been applied or not. For the third column, if the value is ImageNet, then in that experiment only one-stage fine-tuning has been performed and the pretrained model, which was trained on the ImageNet, has been fine-tuned using the training part of the UERC dataset. If the value is Multi-PIE, then two-stage fine-tuning has been applied, first on the Multi-PIE ear dataset, then on the training part of the UERC dataset.

Compared to the results in Table III, the attained performance is significantly lower. Although the number of subjects to classify is less in the UERC dataset compared to the Multi-PIE ear dataset —166 vs. 205—, due to challenging appearance variations and low quality images, ear recognition on the UERC dataset is a far more difficult problem.

The first three rows of the Table IV corresponds to the experimental results obtained in [3]. For that study, the authors have employed AlexNet [4], VGG-16 [5], and SqueezeNet [20], and also utilized data augmentation. Comparing the accuracies obtained with AlexNet [4] and VGG-16 [5] in [3] and in our study under the same setup, that is with data augmentation and one-stage fine-tuning, it can be seen that our implementation has a slight improvement. In [3], 49.51% and 51.25% correct classification rates have been achieved using AlexNet [4] and VGG-16 [5], respectively, whereas in our study we have reached accuracies of 52% and 54.2%, respectively. This slight increase could be due to the differences in the parameters used for data augmentation and fine-tuning procedure.

From Table IV, it can be observed that the proposed two-stage fine-tuning procedure results in improved performance. For AlexNet [4], with data augmentation and without alignment, the correct classification rate is increased from 52% to 56.46%. For VGG-16 [5] and GoogLeNet [6], the increase is from 54.2% to 63.62% and from 55.02% to 60.91%, respectively. These significant improvements indicate that domain adaptation is indeed necessary and useful. This finding is in line with the results obtained in [24], where we have shown that when limited amount of training data is available for a task, it is more useful to transfer a pretrained model, which is trained on the images from the same domain. Specifically, for example, for age and gender classification, it is more useful to transfer a pretrained model, which is trained on face images, compared to transferring a pretrained model, which is trained on generic object images. In summary, compared to the results obtained with the VGG-16 [5] model in [3], we have achieved around 12% absolute increase in performance —51.25% vs. 63.62%. Similar to the results obtained on the Multi-PIE ear dataset, alignment did not lead to an improvement. Again, it should be noted that no precise registration of the ear images has been done and they are only aligned roughly to one side, therefore, this point requires further investigation. Among the employed models, VGG-16 model is found to be the best performing one.

We then fused the individual models in order to improve the performance further. For each model two-stage fine-tuning has been performed. Data augmentation has been applied and alignment has been omitted. We utilized the max rule [29] to combine the classification scores. We have employed five different confidence score calculation schemes —basic, d2s, d2sr, avg-diff, diff1— as listed in Table II. The results are given in Table V. The best performance is obtained when combining the best two performing models, that is VGG-16 [5] and GoogLeNet [6], leading to 67.5% correct classification, which is around 4% higher than the one obtained with the single best performing model. No significant performance difference is observed between the employed confidence score calculation methods.

Fig. 7: Sample ear images from the datasets used for dataset identification experiments: (a) Multi-PIE Ear Dataset, (b) AWE, (c) AMI, (d) WPUT, (e) IITD, and (f) CP.
Models Basic d2s d2sr avg-diff diff1
AlexNet + VGG-16 63.95% 64.06% 63.84% 63.95% 64.06%
AlexNet + GoogLeNet 63.51% 64.06% 64.16% 63.51% 63.73%
VGG-16 + GoogLeNet 67.53% 67.31% 67.53% 67.53% 67.42%
All 66.34% 66.01% 65.68% 66.34% 66.23%
TABLE V: UERC dataset fusion results

4.3 Effect of Image Quality on the Performance

The effect of aspect ratio and illumination conditions of the image on the recognition performance has been analyzed. The results are shown in Fig. 6. As can be seen in Fig. 6(a), different aspect ratios occur due to varying view angles and ear shapes. Low aspect ratio, i.e. between 0-1, mainly implies in-plane rotated ear images, while higher aspect ratios, i.e. higher than 2, mainly refers to the cases of out-of-plane view variations. Experimental results show that the ear recognition system performs better, when the ear images are cropped from profile faces. Rotations of larger degrees and out-of-plane variations cause a performance drop. Samples of illumination variations from the UERC dataset can be seen in Fig. 6(b). Mean values in the x-axis correspond to the average intensities of the ear images. In the dark images, the details of the ear are not visible causing a loss of information. On the other hand, when the image is very bright, reflections and saturated intensity values are observed. Both of these conditions deteriorate the performance.

4.4 Dataset Identification

During our ear recognition system development and training for the UERC challenge [2], we have tried to utilize the previously proposed ear datasets. We have combined them and used them for training. However, we could not have achieved a performance improvement. This outcome led us to consider the problem of dataset bias. In order to investigate this, we have designed an experiment, in which the class labels of the ear images are the names of the datasets that they belong to. That is, in this experiment, input to the deep CNN model is an ear image and the classification output is the name of the dataset that it belongs to. The goal was to observe whether the deep CNN model can distinguish the differences between the datasets. For this experiment six different ear datasets have been used, namely the Multi-PIE ear dataset, AWE [1], AMI [10], WPUT [9], IITD [8], and CP [7] datasets. Sample images from these six datasets can be seen in Fig 7. In this experiment, VGG-16 model [5] has been fine-tuned using the training parts of these datasets. Obtained training accuracy was 100%. This fine-tuned model has achieved 99.71% correct classification on the test set. Clearly, the system can easily identify ear images from different datasets. This is a very interesting and important outcome that requires further investigation in the future studies.

5 Conclusion

In this study, we have addressed several aspects of ear recognition. First, we have proposed a two-stage fine-tuning strategy for deep convolutional neural networks in order to perform domain adaptation. For this approach, we have first constructed an ear dataset from the Multi-PIE face dataset [14, 15], which we named as Multi-PIE ear dataset. In the first stage, we have fine-tuned the pretrained deep CNN models, which were trained on the ImageNet, on this newly collected dataset. This provides domain adaptation for the pretrained deep CNN models. In the second stage, we perform fine-tuning operation on the target dataset, which is the UERC dataset [2], in this work. This second stage provides a more specific domain and/or dataset adaptation. This step is also very crucial, since as we have shown in the experiments, there exists a dataset bias [27] among the ear recognition datasets. We have also combined the deep CNN models to improve the performance further. Besides, we have analyzed in depth the effect of ear image quality, intensity level and aspect ratio, on the classification performance.

We have conducted extensive experiments on the UERC dataset [2]. We have shown that performing two-stage fine-tuning is very beneficial for ear recognition. With data augmentation and without alignment, for AlexNet [4], the correct classification rate is increased from 52% to 56.46%. For VGG-16 [5] and GoogLeNet [6], the increase is from 54.2% to 63.62% and from 55.02% to 60.91%, respectively. This consistent improvement indicates the importance of transferring a pretrained CNN model from a closer domain. It has been observed that combining different deep convolutional neural network models has led to further improvement in performance. We have achieved the best performance by combining the best two performing models, that is VGG-16 [5] and GoogLeNet [6], leading to 67.5% correct classification, which is around 4% higher than the one obtained with the single best performing model. We have noticed that performing alignment did not improve the performance. However, this point requires further investigation, since the ear images have not been precisely registered and they have been only coarsely aligned by flipping them to one side. Effect of different aspect ratios, which have been resulted in due to varying view angles and ear shapes, and illumination conditions have also been studied. The ear recognition system performs better, when the ear images are cropped from profile faces. Very dark and very bright illumination causes missing details and reflections, which results in performance deterioration. Finally, we have conducted experiments to examine the dataset bias. Given an ear image as input, we were able to classify the dataset that it has come from with 99.71% accuracy, which indicates a strong bias among the ear recognition datasets. For future work, we plan to address automatic ear detection, precise ear alignment, and dataset bias, which are important research problems in the ear recognition field.

Acknowledgments

This work was supported by the Istanbul Technical University Research Fund, ITU BAP, project no. 40893.

References

  • [1] Emeršič, Ž., Štruc, V., Peer, P.: ’Ear recognition: More than a survey’, Neurocomputing, 2017, 255, pp. 26-39
  • [2] Emeršič, Ž., Štepec, D., Štruc, V., Peer, P., George, A., Ahmad, A., Omar, E., Boult, T. E., Safdari, R., Zhou, Y., Zafeiriou, S., Yaman, D., Eyiokur, F. I., Ekenel, H. K.: ’The unconstrained ear recognition challenge’, International Joint Conference on Biometrics (IJCB), 2017
  • [3] Emeršič, Ž., Štepec, D., Štruc, V., Peer, P.: ’Training convolutional neural networks with limited training data for ear recognition in the wild’, Automatic Face & Gesture Recognition (FG), 2017, pp. 987-994
  • [4] Krizhevsky, A., Sutskever, I., Hinton, G.E.: ’ImageNet classification with deep convolutional neural networks’, Advances in Neural Information Processing Systems (NIPS), 2012, pp. 1097-1105
  • [5] Simonyan, K., Zisserman, A.: ’Very deep convolutional networks for large-scale image recognition’, International Conference on Learning Representations (ICLR), 2015
  • [6] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: ’Going deeper with convolutions’, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1-9
  • [7] Carreira-Perpinan, M.A.: ’Compression neural networks for feature extraction: Application to human recognition from ear images’, Master’s thesis, Faculty of Informatics, Technical University of Madrid, Spain, 1995
  • [8] Kumar, A., Wu, C.: ’Automated human identification using ear imaging’, Pattern Recognition, 45, (3), 2012, pp. 956-968
  • [9] Frejlichowski, D., Tyszkiewicz, N.: ’The west pomeranian university of technology ear database-a tool for testing biometric algorithms’, Image Analysis and Recognition, 2010, pp. 227-234
  • [10] González-Sánchez, E.: ’Biometria de la oreja’, Ph.D. thesis, Universidad de Las Palmas de Gran Canaria, Spain, 2008
  • [11] Hurley, D.J., Nixon, M.S., Carter, J.N.: ’Ear biometrics by force field convergence’, International Conference on Audio- and Video-Based Biometric Person Authentication (AVBPA), 2005, pp. 386-394
  • [12] Prakash, S., Gupta, P.: ’An efficient ear recognition technique invariant to illumination and pose’, Telecommunication Systems, 2013, 52, (3), pp. 1435-1448
  • [13] Wang, Z.Q., Yan, X.D.: ’Multi-scale feature extraction algorithm of ear image’, IEEE International Conference on Electric Information and Control Engineering (ICEICE), 2011, pp. 528-531
  • [14] Gross, R., Matthews, I., Cohn, J.F., Kanade, T., Baker, S.: ’Multi-PIE’, IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2008
  • [15] Gross, R., Matthews, I., Cohn, J. F., Kanade, T., Baker, S.: ’Multi-PIE’, Image and Vision Computing, 2010, pp. 807-813
  • [16] Ioffe, S., Szegedy, C.: ’Batch normalization: Accelerating deep network training by reducing internal covariate shift’, International Conference on Machine Learning, 2015, pp. 448-456
  • [17] Russakovsky, O., Deng, J., Su, H., et al.: ’ImageNet large scale visual recognition challenge’, International Journal of Computer Vision, 2015, 115.3, pp. 211-252
  • [18] Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: ’Dropout: a simple way to prevent neural networks from overfitting’, Journal of Machine Learning Research, 2014, 15.1, pp. 1929-1958
  • [19] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ’ImageNet: A large-scale hierarchical image database’, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248-255
  • [20] Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: ’SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and¡ 0.5 MB model size’, arXiv preprint arXiv:1602.07360, 2016
  • [21] Guo, Y., Xu, Z.: ’Ear recognition using a new local matching approach’, IEEE International Conference on Image Processing (ICIP), 2008, pp. 289-292
  • [22] ’Introduction to USTB ear image databases’, \(http://www1.ustb.edu.cn/resb/en/index.htm\), accessed September 2017
  • [23] Galdámez, P.L., Raveane, W., Arrieta, A.G.: ’A brief review of the ear recognition process using deep neural networks’, Journal of Applied Logic, 2016
  • [24] Ozbulak, G., Aytar, Y., Ekenel, H.K.: ’How transferable are CNN-based features for age and gender classification?’, IEEE International Conference of Biometrics Special Interest Group (BIOSIG), 2016, pp. 1-6
  • [25] Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: ’How transferable are features in deep neural networks?’, Advances in Neural Information Processing Systems (NIPS), 2014, pp. 3320-3328.
  • [26] LeCun, Y., Bengio, Y., Hinton, G.:’Deep learning’, Nature, 2015, 7553, (521), pp. 436-444
  • [27] Torralba, A., Efros, A.A: ’Unbiased look at dataset bias’, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 1521-1528
  • [28] ’Open Source Computer Vision Library’, \(https://opencv.org/\), accessed September 2017
  • [29] Kittler, J., Hatef, M., Duin, R.P., Matas, J.: ’On combining classifiers’, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20, (3), pp. 226-239
  • [30] Vu, N., Caplier, A.: ’Face recognition with patterns of oriented edge magnitudes’, Computer Vision (ECCV), 2010, pp. 313-326.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
131587
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description