Neonatal Pain Expression Recognition Using Transfer Learning
Transfer learning using pre-trained Convolutional Neural Networks (CNNs) has been successfully applied to images for different classification tasks. In this paper, we propose a new pipeline for pain expression recognition in neonates using transfer learning. Specifically, we propose to exploit a pre-trained CNN that was originally trained on a relatively similar dataset for face recognition (i.e., VGG-Face) as well as CNNs that were pre-trained on a relatively different dataset for image classification (i.e., VGG-F,M,S) to extract deep features from neonatesâ faces. In the final stage, several supervised machine learning classifiers are trained to classify neonatesâ facial expression into pain or no pain expression. The proposed pipeline achieved, on a testing dataset, 0.841 AUC and 90.34% accuracy, which is approx. 7% higher than the accuracy of handcrafted traditional features. We also propose to combine deep features with traditional features and hypothesize that the mixed features would improve pain classification performance. Combining deep features with traditional features achieved 92.71% accuracy and 0.948 AUC. These results show that transfer learning, which is a faster and more practical option than training CNN from the scratch, can be used to extract useful features for pain expression recognition in neonates. It also shows that combining deep features with traditional handcrafted features is a good practice to improve the performance of pain expression recognition and possibly the performance of similar applications.
Infants receiving care in the Neonatal Intensive Care Unit (NICU) might experience up to several hundred painful procedures during their stay . Pediatric studies have reported several long-term outcomes of repeated pain exposure in early life. For instance, it has been found  that repeated painful experience in neonates is associated with alterations in the cerebral white matter and subcortical grey matter and delayed cortico-spinal development. These alterations in neurodevelopment can result in a variety of behavioral, developmental and learning disabilities . Other long-term outcomes of pain exposure that are reported  at school age include delayed visualâperceptual development, lower IQs, and internalizing behavior.
The recognition of the adverse outcomes associated with neonatal pain exposure has led to the recommendation of using opioids such as Fentanyl and Morphine. Although analgesic medications can reduce the consequences of neonatal pain exposure, recent studies found a link between the excessive use of these medications and many short- and long-term side effects. Zwicker et al.  found that 10-fold increase in Morphine, an agent commonly used for neonatal pain management, is associated with impaired cerebellar growth in the neonatal period and poorer neurodevelopmental outcomes in early childhood period. The long-term side effects of another well-known analgesic medication (i.e., Fentanyl) were discussed in . This study described Fentanyl as an extremely potent analgesic and listed several side effects, such as neuroexcitation, respiratory depression, for using high doses of Fentanyl.
These results suggest that the failure to recognize and treat pain when needed (i.e., under treatment) as well as the administration of analgesic medications in the absence of pain (i.e., over treatment) can cause serious outcomes and permanently changes the brain structure and functions. The annual cost of care related to adverse neurodevelopmental outcomes in preterm infants alone is estimated at over 7 billion dollars .
Because pain assessment is the cornerstone of pain management, the assessment of neonatal pain should be accurate and continuous. Currently, caregivers assess neonatal pain by observing behavioral (e.g., facial expression and crying) and physiological (e.g., vital signs changes) indicators using multidimensional pain scales such as NIPS (Neonatal Infant Pain Scale) , FLACC (Face, Legs, Activity, Crying, and Consolability) , and NFCS (Neonatal Facial Coding System) . This practice is inconsistent because it depends highly on the observer bias. Additionally, it is discontinuous and requires a large number of well-trained nurses to ensure proper utilization of the tools. The discontinuous nature of the current practice as well as the inter-rater variation may result in delayed intervention and inconsistent treatment of pain. Therefore, developing automated and continuous tools that can generate immediate and more consistent pain assessment is crucial.
2 Existing Work and Contribution
The recent innovations in computer vision facilitated the development of automated approaches that continuously and consistently assess pain. A large body of methods has been proposed to automatically assess pain using behavioral (e.g., facial expression [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 11, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32] and crying [33, 34]) or physiological (e.g., changes in vital signs [35, 36] and cerebral hemodynamic changes [37, 38]) indicators. The vast majority of these methods assess and estimate pain based on analysis of facial expression. This focus is due to the fact that facial expression is the most common and specific indicator of pain . As such, most pediatric pain scales [7, 8, 9] include facial expression as a main indicator for pain assessment.
Of the existing methods for automatic pain expression analysis, only few methods [29, 31, 32] focused on neonatal pain. This can be attributed to the lack of publicly-available neonatal datasets. Another reason is the common belief that the algorithms designed for adults should have similar performance when applied in neonates. Contrary to this belief, we think the methods designed for assessing adultsâ pain will not have similar performance and might completely fail for two main reasons. First, the facial morphology and dynamics vary between infants and adults as reported . Moreover, infantsâ facial expressions include additional movements and units that are not present in the Facial Action Coding System. As such, Neonatal FACS was introduced as an extension of FACS [40, 9]. Second, we think the preprocessing stage (e.g., face tracking) is more challenging in the case of infants because they are uncooperative subjects recorded in an unconstrained environment.
The methods of automatic recognition of neonatal pain expression can be divided into two main categories: static and dynamic methods.
Static methods extract pain-relevant features from static images and use the extracted features to train off-the-shelf classifiers. One of the first work that detects and classify pain expression from infants’ images (COPE dataset) is presented in . The proposed method takes a static image as input and concatenates it into a feature vector of dimensions with values ranging from 0 to 255. Then, Principal Component Analysis (PCA) was applied to reduce the vector’s dimensionality. For classification, distance-based classifiers and Support Vector Machines (SVMs) were used to classify the images into one of the following four pair: pain/no-pain, pain/cry, pain/air puff, and pain/friction. The results showed that SVMs evaluated using 10-fold cross-validation achieved the best recognition rate and outperformed distance-based classifiers in classifying pain versus no-pain (88.00%), pain versus rest (94.62%), pain versus cry (80.00%), pain versus air-puff (83.33%), and pain versus friction (93.00%). This work was extended  by employing Sequential Floating Forward Selection for feature selection and Neutral Network Simultaneous Optimization Algorithm (NNSOA) for classification, and an average classification rate of 90.2% was obtained. Nanni et al.  applied several variations of Local Binary Pattern (LBP) on static images of the COPE dataset to classify them into pain and no-pain expression. These variations include Local Ternary Pattern (LTP), Elongated Local Ternary Pattern (ELTP), and Elongated Local Binary Pattern (ELBP). The highest performance was achieved by ELTP with AUC (Area under the Curve of Receiver Operating Characteristic Curve) score of 0.93. A complete review of the exiting methods for pain expression recognition can be found in .
The above-listed works utilize traditional handcrafted features for classification. Recently, deep feature extracted from a Convolutional Neural Networks (CNN) showed good performance in several classification tasks. The main difference between handcrafted features and deep features is that the features extracted by CNN are learned, at multiple levels of abstraction, directly from the data in contrast to the handcrafted features that are designed beforehand by human experts to extract a given set of chosen characteristics.
This paper contributes a novel pipeline to recognize pain expression in neonates using transfer learning. Specifically, we propose to use four pre-trained Convolutional Neural Networks (CNNs) architectures, namely VGG-F, VGG-M, VGG-S, and VGG-Face, and show that these pre-trained CNNs can be used to extract useful features for pain expression classification in neonates. VGG-F,M,S architectures were originally trained on ImageNet dataset (approx. 1.2M images and 1000 class) for image classification while VGG-Face was trained on a large Face dataset (approx. 2.6M face images of 2622 identities) for face recognition. We hypothesize that the architectures that were originally trained on ImageNet for image classification can be used to extract useful features for pain classification. We also hypothesize that VGG-Face can be used for pain classification and it would provide better performance than the first three architectures because it was pre-trained on a relatively similar dataset. The reason for choosing an architecture trained to recognize faces instead of emotions is that face recognition is well-studied and validated on large volume datasets as compared to emotion classification. Moreover, the features of face recognition and facial expression recognition are rather similar since both tasks involve analyzing human faces [alexandr2017group].
In addition, this paper proposes a new approach to pain-emotion analysis that incorporates both deep features and traditional handcrafted features. We hypothesize that the mixed features can improve pain classification performance.
Organization: Section 3 describes the infants’ pain dataset utilized in this work. Section 4 presents the prepossessing stage of our proposed pipeline, provides brief introduction for Convolutional Neural Networks, and discusses how we used transfer learning for pain classification. Section 5 presents the experiments we designed to evaluate our hypotheses and summarizes the results. We conclude in section 6.
3 Infants Pain Assessment Database
Infants (N = 31 infants, 16 female and 15% male) were recorded undergoing a brief acute stimulus such as heel lancing or immunization during their hospitalization in the NICU at Tampa General Hospital. Infants’ average gestational age is 36.4, ranging from 30.4 to 40.6 (SD = 2.7). The ethnic distribution was 17% Caucasian, 47% White, 17% African American, 12% Asian, and 7% other. Any infant born in the range of 28 and 41 gestation weeks was eligible for enrollment after obtaining informed consent from the parents. We excluded infants with cranial facial abnormalities and neuromuscular disorders.
3.2 Video Recording and Ground Truth Labeling
We used GoPro Hero3+ camera to record infants’ facial expression, body movement, and crying sound. All the recordings were carried out in the normal clinical environment that is only modified by the addition of the camera.
We recorded each infant in seven time periods: 1) Prior the painful procedure to get the baseline state observation; 2) Procedure preparation period that begins with first touch, may include positioning or skin preparation and ends with skin breaking; 3) Painful procedure, lasts the duration of the procedure; 4) One minute post the completion of the painful procedure; 5) Two minute post the completion; 6) Three minute post the completion; and 7) Recovery period five minutes post the procedure. Each time period was observed by trained nurses to provide the pain assessment using NIPS (Neonatal Infant Pain Scale).
NIPS scale consists of six elements, which are facial expression, crying, body movement (i.e., arms and legs), state of arousal, and breathing patterns. Each element of the NIPS was scored on a scale of 0-1 with the exception of cry, which is scored on a scale of 0-1-2. A total score of 3-4 represents moderate pain and a score 4 indicates severe pain. To get the ground truth for each video epoch, we used the thresholding of the total score (i.e., severe pain, moderate pain, or no pain) as the label for algorithm evaluation. In this paper, we included pain/no-pain labels and excluded moderate pain because the number of epochs for moderate pain in the current dataset is small.
4 Automatic Pain Expression Recognition
The proposed pain expression recognition pipeline consists of three main stages: 1) face detection and preprocessing, 2) deep feature extraction using transfer learning, 3) feature selections and classification. We describe each stage in detail below.
4.1 Automatic Face Detection and Preprocessing
We applied ZFace , which is a person-independent tracker, in each video to detect the face and obtain 49 facial landmark points. The tracker outputs the 49 points’ coordinates as well as a failure message to indicate the failure frames; those frames were excluded from further analysis. For each frame, we used the detect points to register and crop the infant’s exact face region. We applied the tracker in 200 videos to detect the face and the landmark points. Then, we selected the key frames from each video, thereby removed many similar frames. The selected frames from all videos (i.e., 3026 frames) were then resized to 224 X 224 to accommodate with CNNs image’s size requirement (244 x 224 x 3, RGB images).
4.2 Deep Features Extraction
In this section, we give a brief introduction to Convolutional Neural Networks (CNNs) and Transfer Learning. We also describe the pipeline we propose to extract useful features for pain classification.
4.2.1 Convolutional Neural Networks
Convolutional Neural Networks (CNNs) gained a lot of popularity in the last decades due to the wide range of its successful applications in natural language processing, recommender systems, medical image analysis, object recognition, and emotion analysis. The power of CNNs, which are biologically-inspired variants of a multilayer perceptrons, can be attributed to its deep architecture that allows to extract a set of features (i.e., features independent of prior knowledge or human effort) at multiple levels of abstraction.
CNN consists of an input layer, an output layer, and three types of hidden layers: convolutional layer, pooling layer, and fully connected layer. The convolutional layer applies k convolutional kernels or filters to the input and pass the result (i.e., feature map) to the next layer. This layer takes as input an image where m represents the image’s width and height and r represent the number of channels (e.g., 3 channels for RGB); the filter’s size of this layer is where n m. The pooling layer takes each feature map as input and performs subsampling by taking the average or the maximum to create new subsampled feature map. The last type of hidden layers is the fully connected layer, which is a regular feed-forward Neural Network layer that computes the activation of each class; this layer is responsible for the high-level reasoning in the network.
In practice, it is more common to use a pre-trained CNN as a fixed feature extractor or as starting point (i.e., fine-tune the weights of pretrained CNNs) instead of training the network from the scratch due to two main reasons. First, it is relatively rare to find a labeled dataset that is large enough (e.g., ImageNet â approx. 1.2 million images and 1000 classes) to train CNNs from the scratch. The vast majority of the existing datasets, especially in the medical domain for neonatal population, are scarce. In fact, we are not aware of any publicly-available neonatal dataset, except the small COPE  dataset (204 face images), collected for pain assessment or similar application. Second, training CNNs requires an extensive computational and memory resources as well as patience and expertise to ensure the proper choice of architecture and learning parameters. Transfer learning is an alternative to training CNN from the scratch that has received much attention in machine learning research and practice [44, 45, 46]. Andrew Ng111“Transfer learning will be - after supervised learning - the next driver of ML commercial success”missing described transfer learning as the next driver of machine learning commercial success.
|Conv 1||, st. 4, pad 0|
|Conv 2||, st. 1, pad 2|
|Conv 3||, st. 1, pad 1|
|Conv 4||, st. 1, pad 1|
|Conv 5||, st. 1, pad 1|
|Full 6||4096 dropout|
|Full 7||4096 dropout|
|Full 8||1000 softmax|
|Conv 1||, st. 2, pad 0|
|Conv 2||, st. 2, pad 1|
|Conv 3||, st. 1, pad 1|
|Conv 4||, st. 1, pad 1|
|Conv 5||, st. 1, pad 1|
|Full 6||4096 dropout|
|Full 7||4096 dropout|
|Full 8||1000 softmax|
|Conv 1||, st. 2, pad 0|
|Conv 2||, st. 1, pad 1|
|Conv 3||, st. 1, pad 1|
|Conv 4||, st. 1, pad 1|
|Conv 5||, st. 1, pad 1|
|Full 6||4096 dropout|
|Full 7||4096 dropout|
|Full 8||1000 softmax|
4.2.2 Transfer Learning
Transfer learning is about applying knowledge that is learned from a previous domain/task to a new relevant domain/task. It offers an attractive solution for the lack of large and annotated datasets issue, which is known to be common in medical application. The idea of transfer learning is inspired from human learning and the fact that people can intelligently learn or solve a new problem using previously learned knowledge.
There are two main scenarios for transfer learning: fine-tuning and fixed feature extractor. The first scenario involves fine-tuning the weights of the higher layers in the pre-trained CNN by continuing backpropagation since these layers contain dataset-specific features while the lower layers contains generic features (e.g., edge detector and color). In the second scenario, the pre-trained CNN is used as a fixed feature extractor to extract deep features after removing the output layer. The extracted features will then be used to train supervised machine learning classifiers (e.g., SVM) for a new task.
In this paper, we propose a pipeline for neonatal pain expression recognition using the second scenario of transfer learning. Specifically, we used four pre-trained CNNs to extract deep features from our relatively small dataset (31 subjects, 3026 images). The first three CNNs architectures, which are VGG-F, VGG-M, and VGG-S , were previously trained on a subset of ImageNet dataset (approx. 1.2M images and 1000 classes) for image classification. Tables 1-3 provide the architectures for VGG-F, VGG-M, and VGG-S respectively. The fourth CNN architecture (depicted in Table 4) is VGG-Face descriptor , which was previously trained on large face dataset (approx. 2.6M face images of 2622 identities) for face recognition. Choosing these pre-trained CNNs allows us to investigate the difference between using CNNs trained on a relatively similar dataset (i.e., VGG-Face, Face dataset) and CNNs trained on a relatively different dataset (i.e., VGG-F,M,S, ImageNet). We hypothesize that these pre-trained CNN architectures can be used to extract useful texture features for pain classification.
For all the four pre-trained architectures, we extracted deep features from the last fully connected layer before the output layer (Full 7 in Tables 1-4) which has high-level features more relevant to the utilized dataset. We also extracted features from the last convolutional layer (Conv 5 in Tables 1-4) which has low and generic features.
|Conv 1-1||, st. 1, pad 1|
|Conv 1-2||, st. 1, pad 1|
|Conv 2-1||, st. 1, pad 1|
|Conv 2-2||, st. 1, pad 1|
|Conv 3-1||, st. 1, pad 1|
|Conv 3-2||, st. 1, pad 1|
|Conv 3-3||, st. 1, pad 1|
|Conv 4-1||, st. 1, pad 1|
|Conv 4-2||, st. 1, pad 1|
|Conv 4-3||, st. 1, pad 1|
|Conv 5-1||, st. 1, pad 1|
|Conv 5-2||, st. 1, pad 1|
|Conv 5-3||, st. 1, pad 1|
|Full 6||4096 dropout|
|Full 7||4096 dropout|
4.3 Feature Selection and Classification
The deep feature vector extracted using transfer learning is high in dimensions, and hence performing feature selection was necessary. In this section, we briefly present two feature selection methods as well as four machine learning classifiers.
4.3.1 Feature Selection
Feature selection methods aim to select, from a given feature vector, the most relevant features while discarding irrelevant or redundant features. In this paper, we used Relief-f and Symmetric Uncertainty methods.
Relief-f  method searches for the neighbor from the same class (i.e., nearest hit) and a neighbor from the opposite class (i.e., nearest miss) for each instance using a nearest neighbor algorithm. It then selects features according to their weight, which increases or decreases as a function of how well the feature distinguishes between distinct classes. In our experiments, we used the best 5, 10, and 15 features for classification.
Symmetric uncertainty is a feature selection method that measures feature-correlation. It selects features based on the hypothesis that, ”Good features subsets contain features highly correlated with the class, uncorrelated to each other” . It is computed as follows :
Where X and Y represent two features and H(X) and H(Y) represent the entropy of these features. This symmetric measure ranges from 0 (uncorrelated) to 1 (correlated). We used the best 5, 10, and 15 features found by the algorithm for classification.
There exist a wide range of classification algorithms, each has its strengths and weakness. In this work, we experimented with Naive Bayes, Nearest Neighbors (kNN), Support Vector Machines (SVMs), and Random Forests (RF) classifiers because they have shown good classification performance in transfer learning applications [44, 51, 52, 53, 53]. A brief overview of these classifiers is presented below.
Naive Bayes is a simple yet efficient probabilistic classifier that depends on Bayes’ Theorem. It simplifies learning because it does not require iterative parameter estimation and makes an assumption, given the class variable, that features are independent. The learning phase of this classifier involves estimating the conditional and prior probabilities for each class. To classify a new instance, Naive Bayes, given the feature values of this instance, computes the posterior probability for each class and assigned the given instance to the class that has the highest probability.
Nearest neighbor is a non-parametric machine learning algorithm that classifies a new instance based on the class of its nearest instances (i.e., neighbors). The classification phase for kNN is delayed to run-time, hence it is also known as a lazy classifier. To find the neighbors for a new instance, a distance metric (e.g., Euclidean distance) is computed between the given instance and k neighbor instances. Then, a majority voting is applied on the k neighbor to choose the most common class as the class for the new instance. This algorithm is simple and effective, but requires large memory space because it needs all the instances to be in memory at run-time.
SVMs is a supervised classifier that performs classification by finding the optimal separating hyperplane that maximizes the margin or the distance between two classes’ closest points (i.e., support vectors); removing those points would change the position of the hyperplane. The mathematical formulation of SVMs and more discussion about it can be found in .
Random forest is a supervised classification algorithm that constructs, at training time, ensembles of decision trees (i.e., forest of trees) and chooses the mode class among all the trained trees. It uses bootstrapping on the training set and random feature selection in the tree induction. This method can run efficiently on large datasets with thousands of features.
5 Experiments and Results
To classify infants’ facial expression as pain or no pain expression, a total of 3026 face images were fed to the four previously-mentioned CNNs architectures to extract deep features. All CNNs are implemented in a MATLAB Toolbox known as MatConvNet .
The extracted deep features were then divided into training (16 subjects, 1514 images/instances) and testing (1512, 15 subjects) sets. For feature selection and classification, we applied feature selectors on the training set followed by machine learning classifiers. We experimented with the feature selection methods and the classifiers as implemented in Weka (version 3.7.13).
We divide the experiments into three main folds: deep features from higher-layer, deep features from lower-layer, and merging deep features with traditional features. The reason for extracting features from both higher and lower layers is to investigate ”what is the best layer to transfer?”. Then, we combined the deep features extracted using transfer learning with traditional features extracted as described in . We present each experiment and report its results next.
5.1 Higher Layer Deep Features
Higher layers (i.e., closer to the output) in CNNs contain high level features that are specific to the utilized dataset. We hypothesize that the deep features extracted from higher fully connected layer of CNNs trained for image classification (i.e., VGG-F,M,S) can be used for pain classification. In our experiment, we removed the output softmax layer (i.e., Full 8 in Tables 1-3) and extracted deep features from Full 7 before applying Rectified Linear Unit function (PreReLU features, 4096 dimensions). We also extracted the features after they have been transformed by a ReLU function (PostReLU features, 4096 dimensions). The results of classifying pain using these features are summarized in the first three columns of Table 5222NB, RFT, RF, and SU represent Naive Byes, Random Forest Trees, RelieF, and Symmetric Uncertainty; (#) indicate number of selected features. . As can be seen from the table, the highest accuracy for pain classification (90.41%) was obtained with the PostReLU features extracted from VGG-S architecture. Although PostReLU features of VGG-S has the highest accuracy, assessing the significance of the difference between AUC of VGG-S (0.742) and AUCs of VGG-F,M showed no significant difference (P=0.05).
In addition to VGG-F,M,S, we extracted deep features from the last fully connected layer of VGG-face CNNs (i.e., Full 7 in Table 4) after removing the output layer. We extracted the features before and after applying ReLU (PreReLU features with 4096 dimensions and PostReLU with 4096 dimensions). We hypothesize that the features extracted using this architecture should achieve better results than VGG-F,M,S since it is trained originally on a dataset relatively similar to our infant’s faces dataset. The last column of Table 5 presents the pain classification results using the deep features of VGG-face CNNs. The performance of pain classification using deep features extracted from VGG-Face achieved 90.34% accuracy and 0.841 AUC. The AUC difference between VGG-Face (0.841) and VGG-S (0.742) is statistically significant at the P=0.05 level; the gray cells in Table 5 indicates this significant difference. This result is consistent with our hypothesis that VGG-Face would achieve better pain classification results.
5.2 Lower Layer Deep Features
Contrary to the higher layers that have features customized according to the dataset used to train the CNN, lower layers contain generic low-level features (e.g., edge detector and colors) that are less specific to the utilized dataset. Experimenting with lower layers’ features allow us to explore the usefulness of generic features for pain classification and compare their results with higher layers.
In case of VGG-F,M,S, the deep features were extracted from the last convolutional layer (i.e., Conv 5 in Tables 1-3) before and after ReLU (PreReLU features and PostReLU features). The feature vectors dimensions are 43264, 86528, and 147968 for VGG-F, VGG-M, and VGG-S respectively. See the first three columns in Table 6 for a summary of performance. As can be seen from the table, the highest accuracy (87.13%) was obtained with the PostReLU features extracted from VGG-F architecture. Although PostReLU features of VGG-F has the highest accuracy, assessing the significance of the difference between AUC of VGG-F (0.713) and AUCs of VGG-M,S showed no statistical difference (P=0.05).
We also extracted deep features from the last convolutional layer (Conv 5 in Table 4) of VGG-Face. The dimensions of the extracted feature vectors before and after applying ReLU is 100352 (see last column in Table 6). Using the lower-layer of VGG-Face for pain classification achieved 88.23% accuracy and 0.797 AUC. The AUC difference between VGG-Face (0.797) and VGG-F (0.713) is statistically significant at the P=0.05 level as indicated by the gray cells in Table 6.
As we mentioned earlier, the reason for extracting features from higher and lower layers is to investigate which layers would give better pain classification results. Therefore, we compared the best result obtained from higher-layer of VGG-F,M,S (i.e., VGG-S, 90,41 accuracy and 0.742 AUC) with the best result obtained from lower-layer of these three CNNs (i.e., VGG-F, 87.13 accuracy and 0.713 AUC). The higher-layer accuracy is approx. 3.7% higher than the lower-layer. However, the AUC difference between them is not statistically significant at the P=0.05 level. Similarly, we compared the best result obtained from higher-layer of VGG-Face with the best result obtained from lower-layer. The former’s accuracy is approx. 2.3% higher than the latter, but the AUC difference between them is not statistically significant at the P=0.05 level.
5.3 Merging Deep and Traditional Features
In this experiment, we combined the top deep features of VGG-Face CNN architecture with traditional handcrafted features extracted using an optical-flow based method presented in .
The optical flow based method works as follows. First, it calculates optical flow between consecutive frames of a video for the entire face region as well as for four regions (i.e., two upper regions and two lower regions). Then, it estimates the optical strain over the flow fields to generate the strain tensor components. Next, the strain magnitude is calculated for each region of the face along with the overall face region; each region generates a sequence (strain plot) corresponding to the amount of strain observed over time. Finally, the points of maximum strain are detected using a peak detector and the descriptive statistics for those peaks are calculated to generate the features (e.g., , , , , and ). Using the strain features for pain classification gave 83.88% accuracy and 0.719 AUC.
Merging the deep features with the traditional strain features improve the pain classification performance. The best result (see Table 7, column 3) was obtained using a combination of five strain features and 10 PostReLU features extracted from the higher fully connected layer. This combination (Table 7, 4th column) showed 9% increase in accuracy as compared to the accuracy of strain features (Table 7, 2nd column) and a statistically significant AUC difference (P=0.05).
To summarize, we present in this section three proposed experiments for neonatal pain classification using transfer learning. In the first two experiments, we extracted deep features from higher layer and lower layer of four pre-trained CNNs architectures. The higher layer features showed higher pain classification accuracy, but the AUC difference was not statistically significant at the P=0.05 level. The best pain classification results were obtained using VGG-Face architecture. This result is consistent with our hypothesis that VGG-Face would achieve better results than VGG-F,M,S since it was trained originally on a relatively similar dataset. In the last experiment, we combined deep features with traditional handcrafted extracted as described in . Using mixed features for pain classification yielded the best result with 92.71% accuracy and 0.948% AUC.
We conclude, based on these preliminary results, that transfer learning can be used to extract useful features for pain classification in neonates. We also conclude that combining both traditional and deep features is a good practice to improve the performance of pain expression classification and possibly the performance of similar tasks. Though, further investigation, on a larger dataset, is required to validate these findings.
6 Conclusion and Future Work
This paper proposes a novel pipeline for neonatal pain expression recognition using pre-trained CNNs as feature extractor. The extracted feature vectors were used to train several machine learning classifiers after applying feature selections to select the most relevant features. The best result (90.34% accuracy and 0.841 AUC) for pain expression recognition was obtained using deep features extracted from the last fully connected layer (Post-ReLU) after removing the output layer. This result is significantly higher (p=0.05) than the pain expression recognition using traditional handcrafted features (83.88% accuracy and 0.719 AUC). Combining both handcrafted and deep features yielded 92.71% accuracy and 0.948 AUC. These results conclude that transfer learning, which is a faster and more practical option than training CNN from the scratch, can be used to extract useful features for pain expression recognition in neonates. It also shows that combining deep features with traditional handcrafted features is a good practice to improve the performance of pain expression recognition, and possibly the performance of similar applications.
As future work, we plan to fine tune the weights of the pre-trained CNNs by continuing the backpropagation. We also plan to incorporate other pain indicators (e.g., crying sound) into facial expression to develop a deep multimodal pain assessment system.
-  M. Cruz, A. Fernandes, and C. Oliveira, “Epidemiology of painful procedures performed in neonates: a systematic review of observational studies,” European Journal of Pain, vol. 20, no. 4, pp. 489–498, 2016.
-  R. E. Grunau, L. Holsti, and J. W. Peters, “Long-term consequences of pain in human neonates,” in Seminars in Fetal and Neonatal Medicine, vol. 11, no. 4. Elsevier, 2006, pp. 268–275.
-  T. Field, “Preterm newborn pain research review,” Infant Behavior and Development, vol. 49, pp. 141–150, 2017.
-  J. G. Zwicker, S. P. Miller, R. E. Grunau, V. Chau, R. Brant, C. Studholme, M. Liu, A. Synnes, K. J. Poskitt, M. L. Stiver et al., “Smaller cerebellar growth and poorer neurodevelopmental outcomes in very preterm infants exposed to neonatal morphine,” The Journal of pediatrics, vol. 172, pp. 81–87, 2016.
-  E. W. Tam, V. Chau, D. M. Ferriero, A. J. Barkovich, K. J. Poskitt, C. Studholme, E. D.-Y. Fok, R. E. Grunau, D. V. Glidden, and S. P. Miller, “Preterm cerebellar growth impairment after postnatal exposure to glucocorticoids,” Science translational medicine, vol. 3, no. 105, pp. 105ra105–105ra105, 2011.
-  A. S. Butler, R. E. Behrman et al., Preterm birth: causes, consequences, and prevention. National Academies Press, 2007.
-  D. Hudson-Barr, B. Capper-Michel, S. Lambert, T. Mizell Palermo, K. Morbeto, and S. Lombardo, “Validation of the pain assessment in neonates (pain) scale with the neonatal infant pain scale (nips),” Neonatal Network, vol. 21, no. 6, pp. 15–21, 2002.
-  T. Voepel-Lewis, S. Merkel, A. R. Tait, A. Trzcinka, and S. Malviya, “The reliability and validity of the face, legs, activity, cry, consolability observational tool as a measure of pain in children with cognitive impairment,” Anesthesia & Analgesia, vol. 95, no. 5, pp. 1224–1229, 2002.
-  J. W. Peters, H. M. Koot, R. E. Grunau, J. de Boer, M. J. van Druenen, D. Tibboel, and H. J. Duivenvoorden, “Neonatal facial coding system for assessing postoperative pain in infants: item reduction is valid and feasible,” The Clinical journal of pain, vol. 19, no. 6, pp. 353–363, 2003.
-  D. Liu, F. Peng, A. Shea, R. Picard et al., “Deepfacelift: Interpretable personalized models for automatic estimation of self-reported pain,” arXiv preprint arXiv:1708.04670, 2017.
-  G. C. Littlewort, M. S. Bartlett, and K. Lee, “Automatic coding of facial expressions displayed during posed and genuine pain,” Image and Vision Computing, vol. 27, no. 12, pp. 1797 – 1803, 2009, visual and multimodal analysis of human spontaneous behaviour:. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0262885609000055
-  Z. Hammal and J. F. Cohn, “Automatic detection of pain intensity,” in Proceedings of the 14th ACM International Conference on Multimodal Interaction, ser. ICMI ’12. New York, NY, USA: ACM, 2012, pp. 47–52. [Online]. Available: http://doi.acm.org/10.1145/2388676.2388688
-  Z. Hammal and M. Kunz, “Pain monitoring: A dynamic and context-sensitive system,” Pattern Recognition, vol. 45, no. 4, pp. 1265 – 1280, 2012. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0031320311003931
-  M. M. Monwar and S. Rezaei, “Pain recognition using artificial neural network,” in 2006 IEEE International Symposium on Signal Processing and Information Technology, Aug 2006, pp. 28–33.
-  P. Lucey, J. Howlett, J. Cohn, S. Lucey, S. Sridharan, and Z. Ambadar, “Improving pain recognition through better utilisation of temporal information,” in International conference on auditory-visual speech processing, vol. 2008. NIH Public Access, 2008, p. 167.
-  A. B. Ashraf, S. Lucey, J. F. Cohn, T. Chen, Z. Ambadar, K. M. Prkachin, and P. E. Solomon, “The painful face â pain expression recognition using active appearance models,” Image and Vision Computing, vol. 27, no. 12, pp. 1788 – 1796, 2009, visual and multimodal analysis of human spontaneous behaviour:. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0262885609000985
-  P. Lucey, J. F. Cohn, K. M. Prkachin, P. E. Solomon, and I. Matthews, “Painful data: The unbc-mcmaster shoulder pain expression archive database,” in Face and Gesture 2011, 2011, pp. 57–64.
-  K. Sikka, A. Dhall, and M. Bartlett, “Weakly supervised pain localization using multiple instance learning,” in 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), April 2013, pp. 1–8.
-  S. Zhu, “Pain expression recognition based on plsa model,” The Scientific World Journal, vol. 2014, 2014.
-  R. Niese, A. Al-Hamadi, A. Panning, D. G. Brammen, U. Ebmeyer, and B. Michaelis, “Towards pain recognition in post-operative phases using 3d-based features from video and support vector machines.” International Journal of Digital Content Technology and its Applications-IJDCTA, vol. 3, no. 4, pp. 21–33, 2009.
-  M. Bartlett, G. Littlewort, M. Frank, and K. Lee, “Automatic decoding of facial movements reveals deceptive pain expressions,” Current Biology, vol. 24, no. 7, pp. 738 – 743, 2014. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S096098221400147X
-  K. Sikka, A. A. Ahmed, D. Diaz, M. S. Goodwin, K. D. Craig, M. S. Bartlett, and J. S. Huang, “Automated assessment of childrenâs postoperative pain using computer vision,” Pediatrics, vol. 136, no. 1, pp. e124–e131, 2015.
-  P. Werner, A. Al-Hamadi, and R. Niese, “Comparative learning applied to intensity rating of facial expressions of pain,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 28, no. 05, p. 1451008, 2014.
-  C. Florea, L. Florea, and C. Vertan, “Learning pain from emotion: Transferred hot data representation for pain intensity estimation.” in ECCV Workshops (3), 2014, pp. 778–790.
-  P. Werner, A. Al-Hamadi, R. Niese, S. Walter, S. Gruss, and H. C. Traue, “Towards pain monitoring: Facial expression, head pose, a new database, an automatic system and remaining challenges,” in Proceedings of the British Machine Vision Conference, 2013, pp. 119–1.
-  D. L. Martinez, O. Rudovic, D. Doughty, J. A. Subramony, and R. Picard, “(338) automatic detection of nociceptive stimuli and pain intensity from facial expressions,” The Journal of Pain, vol. 18, no. 4, p. S59, 2017.
-  N. Rathee and D. Ganotra, “Multiview distance metric learning on facial feature descriptors for automatic pain intensity detection,” Computer Vision and Image Understanding, vol. 147, pp. 77 – 86, 2016, spontaneous Facial Behaviour Analysis. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1077314215002684
-  S. Brahnam, C.-F. Chuang, F. Y. Shih, and M. R. Slack, “Machine recognition and representation of neonatal facial displays of acute pain,” Artificial Intelligence in Medicine, vol. 36, no. 3, pp. 211 – 222, 2006. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0933365705000138
-  S. Brahnam, C.-F. Chuang, R. S. Sexton, and F. Y. Shih, “Machine assessment of neonatal facial expressions of acute pain,” Decision Support Systems, vol. 43, no. 4, pp. 1242 – 1254, 2007, special Issue Clusters. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167923606000261
-  B. Gholami, W. M. Haddad, and A. R. Tannenbaum, “Relevance vector machine learning for neonate pain intensity assessment using digital imaging,” IEEE Transactions on Biomedical Engineering, vol. 57, no. 6, pp. 1457–1466, June 2010.
-  L. Nanni, S. Brahnam, and A. Lumini, “A local approach based on a local binary patterns variant texture descriptor for classifying pain states,” Expert Systems with Applications, vol. 37, no. 12, pp. 7888–7894, 2010.
-  G. Zamzami, G. Ruiz, D. Goldgof, R. Kasturi, Y. Sun, and T. Ashmeade, “Pain assessment in infants: Towards spotting pain expression based on infants’ facial strain,” in Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, vol. 5. IEEE, 2015, pp. 1–5.
-  P. Pal, A. N. Iyer, and R. E. Yantorno, “Emotion detection from infant facial expressions and cries,” in 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 2, May 2006, pp. II–II.
-  C.-Y. Pai, “Automatic pain assessment from infantsâ crying sounds,” 2016.
-  P. M. Faye, J. De Jonckheere, R. Logier, E. Kuissi, M. Jeanne, T. Rakza, and L. Storme, “Newborn infant pain assessment using heart rate variability analysis,” The Clinical journal of pain, vol. 26, no. 9, pp. 777–782, 2010.
-  S. Gruss, R. Treister, P. Werner, H. C. Traue, S. Crawcour, A. Andrade, and S. Walter, “Pain intensity recognition rates via biopotential feature patterns with support vector machines,” PloS one, vol. 10, no. 10, p. e0140330, 2015.
-  J. E. Brown, N. Chatterjee, J. Younger, and S. Mackey, “Towards a physiology-based measure of pain: patterns of human brain activity distinguish painful from non-painful thermal stimulation,” PloS one, vol. 6, no. 9, p. e24124, 2011.
-  M. Ranger and C. GÃ©linas, “Innovating in pain assessment of the critically ill: Exploring cerebral near-infrared spectroscopy as a bedside approach,” Pain Management Nursing, vol. 15, no. 2, pp. 519 – 529, 2014. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1524904212000422
-  H. D. Hadjistavropoulos, K. D. Craig, R. E. Grunau, and M. F. Whitfield, “Judging pain in infants: behavioural, contextual, and developmental determinants,” Pain, vol. 73, no. 3, pp. 319–324, 1997.
-  H. Oster, “Baby facs: Facial action coding system for infants and young children,” Unpublished monograph and coding manual. New York University, 2006.
-  S. Brahnam, L. Nanni, and R. Sexton, Introduction to Neonatal Facial Pain Detection Using Common and Advanced Face Classification Techniques. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp. 225–253. [Online]. Available: https://doi.org/10.1007/978-3-540-47527-9_9
-  G. Zamzmi, C.-Y. Pai, D. Goldgof, R. Kasturi, Y. Sun, and T. Ashmeade, “Machine-based multimodal pain assessment tool for infants: a review,” arXiv preprint arXiv:1607.00331, 2016.
-  L. A. Jeni, J. F. Cohn, and T. Kanade, “Dense 3d face alignment from 2d videos in real-time,” in Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, vol. 1. IEEE, 2015, pp. 1–8.
-  R. Paul, S. H. Hawkins, Y. Balagurunathan, M. B. Schabath, R. J. Gillies, L. O. Hall, and D. B. Goldgof, “Deep feature transfer learning in combination with traditional features predicts survival among patients with lung adenocarcinoma,” Tomography: a journal for imaging research, vol. 2, no. 4, p. 388, 2016.
-  H.-W. Ng, V. D. Nguyen, V. Vonikakis, and S. Winkler, “Deep learning for emotion recognition on small datasets using transfer learning,” in Proceedings of the 2015 ACM on international conference on multimodal interaction. ACM, 2015, pp. 443–449.
-  S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
-  K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” arXiv preprint arXiv:1405.3531, 2014.
-  O. M. Parkhi, A. Vedaldi, A. Zisserman et al., “Deep face recognition.” in BMVC, vol. 1, no. 3, 2015, p. 6.
-  K. Kira and L. A. Rendell, “A practical approach to feature selection,” in Proceedings of the ninth international workshop on Machine learning, 1992, pp. 249–256.
-  M. A. Hall, “Correlation-based feature selection for machine learning,” 1999.
-  A. G. Rassadin, A. S. Gruzdev, and A. V. Savchenko, “Group-level emotion recognition using transfer learning from face identification,” arXiv preprint arXiv:1709.01688, 2017.
-  A. B. Sargano, X. Wang, P. Angelov, and Z. Habib, “Human action recognition using transfer learning with deep representations,” in Neural Networks (IJCNN), 2017 International Joint Conference on. IEEE, 2017, pp. 463–469.
-  Y. Bar, I. Diamant, L. Wolf, and H. Greenspan, “Deep learning with non-medical training used for chest pathology identification,” in Proc. SPIE, vol. 9414, 2015, p. 94140V.
-  M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Support vector machines,” IEEE Intelligent Systems and their applications, vol. 13, no. 4, pp. 18–28, 1998.
-  A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural networks for matlab,” in Proceedings of the 23rd ACM international conference on Multimedia. ACM, 2015, pp. 689–692.
-  G. Zamzmi, C.-Y. Pai, D. Goldgof, R. Kasturi, Y. Sun, and T. Ashmeade, “Automated pain assessment in neonates,” in Scandinavian Conference on Image Analysis. Springer, 2017, pp. 350–361.