Wasserstein Distance based Deep Adversarial Transfer Learning for Intelligent Fault Diagnosis
Abstract
The demand of artificial intelligent adoption for condition based maintenance strategy is astonishingly increased over the past few years. Intelligent fault diagnosis is one critical topic of maintenance solution for mechanical systems. Deep learning models, such as convolutional neural networks (CNNs), have been successfully applied to fault diagnosis tasks for machinery systems, and achieved promising results. However, for diverse working conditions in industry, deep learning suffers two difficulties: one is that the welldefined (source domain) and new (target domain) datasets are with different feature distributions; and another one is the fact that insufficient or no labelled data in target domain significantly reduce the accuracy of fault diagnosis. As a novel idea, deep transfer learning (DTL) is created to perform learning in the target domain by leveraging information from relevant source domain. Inspired by Wasserstein distance of optimal transport, in this paper, we propose a novel DTL approach to intelligent fault diagnosis, namely Wasserstein Distance based Deep Transfer Learning (WDDTL), to learn domain feature representations (generated by a CNN based feature extractor) and to minimize the distributions between the source and target domains through adversarial training. The effectiveness of the proposed WDDTL is verified through 3 transfer scenarios and 16 transfer fault diagnosis experiments of both unsupervised and supervised (with insufficient labeled data) learning. We also provide comprehensive analysis on the network visualization of those transfer tasks.
I Introduction
Fault diagnosis aims to isolate faults on defective systems by monitoring and analyzing machine status using acquired measurements and other information, which requires experienced experts with a high skill set. This drives the demand of artificial intelligent techniques to make fault diagnosis decisions. The deployment of a realtime fault diagnosis framework allows the maintenance team to act in advance to replace or fix the affected components, thus, improving production efficiency and guarantee operational safety.
Over the past decade, many advanced signal processing and machine learning techniques have been used for fault diagnosis. Signal processing techniques such as wavelet [1] and HilbertHuang transform [2] are adopted for feature extraction from faulty vibration signals, and machine learning models are then applied to automate the fault diagnosis procedure. In last few years, deep learning models, such as deep belief networks (DBN) [3], sparse autoencoder [4], and especially convolutional neural networks (CNN) [5], have shown superior fitting and learning ability in fault diagnosis tasks over ruledbased and modelbased methods. However, the above stated deep learning approaches suffer two difficulties: 1) Most of the approaches work well under a same hypothesis: the datasets for source domain and target domain tasks are required to be identically distributed. Thus, the adaptability of the pretrained network is limited when facing new diagnosis task, where the different operational conditions and physical characteristics of the new task might cause distribution difference between the new dataset (target dataset) and the original dataset (source dataset). As a result, for a new fault diagnosis task, the deep learning model is commonly reconstructed from scratch, which results in the waste of computational resources and training time; 2) Insufficient labeled or unlabeled data in target domain is another common problem. In real industry situations, for a new diagnosis task, it is extremely difficult to collect sufficient typical samples to rebuild a largescale and highquality dataset to train a network.
Deep transfer learning (DTL) [6, 7] aims to perform learning in a target domain (with insufficient labeled or unlabeled data) by leveraging knowledge from relevant source domains (with sufficient labeled data), saving much expenditure on reconstructing a new fault diagnosis model from scratch and recollecting sufficient diagnosis labeled samples. Many successful approaches to DTL has been seen in various fields, including pattern recognition [8], image classification [9], and speech recognition [10].
Solutions to DTL can be roughly classified into three categories: instancesbased DTL, networkbased DTL, and mappingbased DTL. Instancesbased DTL reweighs/subsamples a group of instances from the source domain to match the distributions in the target domain. Networkbased DTL crops out partial of the network pretrained in the source domain, which is transferred to be a part of target network for a relevant new task; see [11, 12] for recent examples of instancesbased and networkbased DTL, respectively. However, above approaches are not capable of learning a latent representation from the deep architecture. Mappingbased DTL, compared with other approaches to adapting deep models, has shown excellent properties through finding a common latent space, where the feature representations for source and target domains are invariant. Tzeng et. al [13] proposed a CNN architecture based network for domain adaptation, which introducing an adaptation layer to learn the feature representations. Maximum mean discrepancy (MMD) metric is used as an additional loss for the overall structure to compute the distribution distance with respect to a particular representation, which helps to select the depth and width of the architecture as well as to regulate the loss function during finetuning. Later, in [14] and [15], MMD was extended to multiple kernel variance MMD (MKMMD) and joint MMD (JMMD) for better domain adaptation performance. However, the limitation of MMD method for domain adaptation is that the computational cost of MMD is quadratically increased with large mount of samples when calculating the Integral Probability Metrics (IPMs) [16]. Recently, Ajovsky et al. [17] indicate that Wasserstein distance can be a new direction to find better distribution mapping. Compared with other popular probability distances and divergences, such as KullbackLeibler (KL) divergence and JensenShannon (JS) divergence, [17] demonstrated that Wasserstein distance is a more sensible cost function when learning distributions supported by low dimensional manifolds. Later on, [18] and [19] proposed a new gradient penalty term for domain critic parameters to solve the gradient vanishing or exploding problems in [17]. Hence, the essence of our proposed approach is to adopt the Wasserstein distance to train a DTL model for intelligent fault diagnosis problem which seeks to minimize the distributions between source domain and target domains. Our motivation of this work is to figure out how Wasserstein distance behaves in transfer learning due to its excellent performance in generative adversarial network (GAN).
This paper concerns the problem of DTL modeling to explore the transferable features of fault diagnosis under different operating conditions, including different motor speeds, and different sensor locations. Firstly, in source domain, a base CNN model is trained with sufficient data. Then, we build a Wasserstein distance based DTL (WDDTL) to learn invariant features between source and target domains. A neural network is introduced (denoted by domain critic) to calculate the empirical Wasserstein distance by maximizing domain critic loss. After this procedure, a discriminator is introduced to optimize the CNNbased feature extractor parameters by minimizing the estimated empirical Wasserstein distance. Through the above adversarial learning process, the transferable features from a source domain where faulty labels are known can be brought to diagnose a new but relevant diagnosis task without any labeled sample. To our best knowledge, this is the first work adopts the Wasserstein distance to CNN for measuring the domain distance in fault diagnosis problems. Experimental results, through 16 transfer tasks, demonstrate the effectiveness of the distance measurement method and the proposed DTL model. This paper makes the following contributions:

Wasserstein distance is used as the distance measurement of domains in fault diagnosis problems to explore better distribution mapping. Mapping features are extracted by a pretrained CNN based feature extractor.

The proposed WDDTL framework could perform both unsupervised and supervised transfer tasks. Consequently, for a new diagnosis task, this is a novel approach which could contribute to solve both unlabeled and insufficient labeled data in real industry applications. Extensive experiments will be conducted to support this statement.

The versatility of our WDDTL approach is demonstrated with transfer learning experiments, in terms of 3 different transfer scenarios and 16 transfer tasks in total. To emphasize, the proposed WDDTL approach surpass the existing transfer learning network DAN with MKMMD in almost all transfer tasks.
This paper is organized as follows. Section II reviews related works including CNN for fault diagnosis and transfer learning. Section III proposes our intelligent fault diagnosis framework by using transfer learning method. Experiment results and comparison are given in Section IV. Finally, conclusion and future work are drawn in Section V.
The following notations will be used throughout this work: the symbol is the real number set, and the symbol is the positive integer set. and represent the source and target domain information respectively.
Ii Related works
In this section, some related work on intelligent fault diagnosis as well as CNN architecture are provided, and followed by a brief introduction associated with transfer learning and Wasserstein distance.
Iia Convolutional Neural Networks
As the most wellknown model in deep learning, in recent years, CNN dominates the recognition and detection problems in computer vision domain. The initial CNN architecture was proposed by LeCun et al. in works [20] and [21], which was inspired by Wiesel and Hubel’s research works in cat recognition [22]. Main characteristics of CNN are local connections, shared weights, and local pooling [23]. The first two characteristics indicate the CNN model require less parameters to detect local information of visual patterns than multilayer perceptron, while the last characteristic offers shift invariance to the network. Typically, 1D CNN will be employed to this work to solve the bearing fault diagnosis problem, which has been widely used with great success in the study of speech recognition and document reading tasks.
In this work, a 1D CNN model, as a base model, will be pretrained in source domain. The CNN extract and learn characteristics of the task by stacking a series of layers with repeated components, including convolutional layers (with activation function), pooling layers, and fully connected layers (with an output classification layer) [24]. A typical CNN architecture is fed to a 1D input layer to accept source domain signal, convolutional layers with rectified linear unit (ReLU) activation functions are followed for feature extraction, max pooling layers are used to downsampling data size, and a fully connected layer combined with a softmax function is finally connected for classification (with predefined labels). To minimize the loss function, model parameters are tuned using Backpropagation algorithm [25] based on stochastic gradient descent (SGD) optimizer, until the predefined maximum number of iterations is reached. More details and expressions of each layer for the bearing fault diagnosis task will be explained in Section III.
IiB Transfer learning
Transfer learning can be a novel tool to solve the basic problem of unlabeled and insufficient data under diverse operating conditions in target domain of mechanical systems, by utilizing the knowledge from source domain to improve the target domain learning performance. Some notations and definitions of transfer learning used in this work are first presented.
To begin with, we define a domain and a task respectively. Given a domain in transfer learning defined as , where represents a marginal probability distribution of a feature space . Given predefined source and target domain datasets and , we have . If and/or , two domains and are with different distribution.
In the meantime, a task in transfer learning is defined as , where represents a label space and is a predictive function and is a conditional probability function. Since the classification categories are the same, source and target domains have the same label space, . Then, we give the definition of transfer learning.
Definition 1. (Transfer learning) Transfer learning is proposed with the aim to learn a prediction function for a learning task by leveraging knowledge from source domain and , where or . In most of the cases, contains a much larger dataset than (i.e., cardinality of is larger than that of ).
IiC Wasserstein distance
Wasserstein distance is recently proposed by researchers [17] to tackle the training difficulty of generative adversarial networks (GAN) when facing discontinuous mapping problem of other distances and divergences in the generator, such as Total Variation (TV) distance and KullbackLeibler (KL) divergence. As an promising way to measure the distance between two distributions for GAN training, Wasserstein distance could be applied to DTL for domain adaptation.
Given a compact metric set , represents the space of probability measures on set . Wasserstein1 distance (also called EarthMover distance) is defined between two distributions , :
(1) 
where is a joint probability distribution and denotes the set of all joint distributions whose marginals are and respectively. Wasserstein1 distance can be viewed as a optimal transport problem, it is aims to find an optimal transport plan . Intuitively, indicates how much of ‘mass’ randomly transported from one place over the domain of , with the aim of transporting the distribution into the distribution . Hence, Wasserstein1 distance is the optimal transport plan with the lowest transport cost.
Iii Wasserstein Distance based Deep Transfer Learning (WDDTL)
Iiia Problem formulation
Since it is difficult to retrofit enough sensors in packaged equipment and industry labeling often requires expensive human efforts for mechanical systems, the challenge of domain adaptation is that there is no or limited labeled highquality data can be collected in target domain. For this reason, supervised domain adaptation approach by finetuning the pretrained architecture to fit the new classification problem in target domain is not feasible. To solve this problem, many existing domain adaptation frameworks [16, 14] using MMD to learn the invariant domain representations, which minimizing the target loss by the source loss with an additional maximum mean discrepancy metric. Our proposed approach WDDTL is a promising alternative for domain adaptation by using the Wasserstein distance, which has been demonstrated with gradient superiority than MMD [17], to minimize the distributions between source domain and target domain. Although Wasserstein distance with MLP has been seen in few domain adaptation works in image classification tasks, to date there is no attempt to adopt this technique into industry or manufacturing and there is no attempt to enhance this technique in deep neural networks. It also has to be noted that we propose to use the CNN architecture to generate features for measuring the Wasserstein distance in both domains, meanwhile, the excellent local feature detection ability of CNN in manufacturing has been explored in work [26]. The problem of this work is formulated as follow:
The DTL with domain adaptation for fault diagnosis is an unsupervised problem, thus, we first define a source domain dataset with labels by with number of samples in the source domain . In the meantime, an unlabeled target domain dataset is defined in the target domain . In most cases, source domain samples are sufficient enough to learn an accurate CNN classier and with much larger data size than the target domain, which means . It is also noted that data in source and target domains share the same feature space () but with different marginal distributions ().
The objective of this work is to construct a transferable framework, named WDDTL, for the target task to minimize target classification error , with the help of the knowledge from source domain task and the implementation of Wasserstein distance for domain adaptation.
IiiB CNN based feature extractor
First of all, we propose to use CNN to train the domain data. A CNN model is pretrained with source domain labelled dataset :
Convolution layer involves a filter and a bias , which are applied to a filter size of for calculating a new feature. An output feature is obtained through the filter and a nonlinear activation function with the following expression:
(2) 
where is the input data representing th subvector of the source domain dataset . ‘’ denotes the convolution operation. The nonlinear activation function, such as hyperbolic tangent (tanh) or rectified linear unit (ReLu), is applied to reduce the risk of vanishing gradient which may impact the convergence of the optimization. Hence, the feature map is defined as , where is the number of features and is the stride for convolution.
Max pooling layer is then applied over the feature map to extract the maximum feature values corresponding to its filter size and the stride size for max pooling. The idea is to capture the maximum features over disjoint regions. Consequently, the features within the small window are similar and therefore illustrating the most important property of CNN.
By stacking multiple layers described above (with varying filter size), a multilayer structure is constructed for feature description. The output features of the multilayer structure is flattened and pass to fullyconnected layers for classification, resulting in probabilitydistributed final outputs over labels. For the pretrained CNN in source domain, Softmax function [27] is selected for classification over the final feature map.
To compute the difference between the predicted label, , and the ground truth, , in source domain, crossentropy function is used to compute the loss:
(3) 
IiiC Domain adaptation via Wasserstein distance
Transferable features of the target domain with unlabeled data or insufficient labeled data can be directly obtained by the pretrained accurate CNN feature extractor of the last subsection. The next problem is to solve the distribution difference between the source and target datasets. To tackle this problem, we utilize Wasserstein1 distance to learn invariant feature representations in a common latent space between two different feature distributions through adversarial training.
The network structure before fullyconnected layer of pretrained CNN model is used as the feature extractor to learn the invariant feature representations from both domains. Given two minibatch of instances and from and for . Both instances are passed through a parameter function : (i.e., feature extractor) with corresponding network parameter that directly generate source features and target features . Let and be the distribution of and respectively.
The aim of domain adaptation via Wasserstein distance [17] is to optimize the parameter to reduce the distance between distributions and . We introduce a domain critic learns a solution : that maps the source and target features to a real number, with corresponding parameters . However, the equation of the infimum in Eq. (1) is highly intractable to handle directly. Thanks to the KantorovichRubinstein duality [28], the Wasserstein1 distance can be computed by
(4) 
where the supermum is over all the 1Lipschitz functions : . The empirical Wasserstein1 distance can be approximately computed as follow:
(5) 
where denotes the domain critic loss between the source data and the target data
Now comes to the optimization problem that find the maximum of Eq. (5) while enforcing the Lipschitz constraint. Arjovsky et al. [17] proposed a weight clipping method after each gradient update to force the parameters inside a compact space. However, this method is time consuming when clipping parameter is large and might result in vanishing gradients when the number of layers is set too big. To solve this problem, [19, 18] suggest to incorporate a gradient penalty to train the domain critic with respects to parameters , where the feature representations h consist of the generated source and target domain features (i.e., and ), as well as points which are randomly selected along the straight line between and pairs.
As the fact that the Wasserstein1 distance is differentiable and continuous almost everywhere, we here to train the critic till optimally by solving the following optimization problem:
(6) 
where is the balancing coefficient.
IiiD Classification with discriminator
The above Section IIIC proposed an unsupervised feature learning for domain adaptation, which may cause the learned feature representations in both domains are not discriminative enough. As stated in Section IIIA, our final objective is to develop an accurate classifier, WDDTL, for target domain , which requires to incorporate the labelled supervised learning of source domain data (and target domain if avaliable) into the invariant feature learning problem. A discriminator [29] (with two fullyconnected layers) is then employed into the representation learning approaches to further reduce the distance between source and target feature distributions. In this step, parameters of domain critic are the ones trained in Section IIIC, while the parameters will be modified to optimize the minimum operator.
Now the final objective function can be expressed in terms of the crossentropy loss of the discriminator according to Eq. (3) and the empirical Wasserstein distance which associated with domain discrepancy, i.e:
(7) 
where denotes the parameters for the discriminator and is the hyperparameter that determines the extent of domain confusion. We omit the gradient penalty (i.e., set equal to 0) when optimizing the minimum operator as it should not affect the representation learning process.
IiiE WDDTL Approach
Iv Experiments
Iva Data description
To validate the effectiveness of the proposed DTL method for fault diagnosis problem, we introduce a benchmark bearing fault dataset acquired by Case Western Reserve University (CWRU) data centre. An experiment testbed (see in Fig. 2) is used to conduct the signals for the detection of defects on bearings. Four types of bearing conditions are inspected, namely health condition, fault on inner race, fault on outer race, and fault on roller, and all those situations are sampled with 12 KHz frequency. Meanwhile, each fault type are running with different level of fault severity (0.007inch, 0.014inch, and 0.021inch fault diameters). Each type of faulted bearing was equipped with the test motor, which runs under four different motor speeds (i.e., 1797 rpm, 1772 rpm, 1750 rpm, and 1730 rpm). Vibration signal of each experiment was recorded for fault diagnosis.
Data preprocessing: Simple data preprocessing techniques are applied to the bearing datasets:

To modify the faulty signal to a stationary process, we here divide the samples to keep each sample has 2000 measurements in both and .

Fast Fourier transform (FFT) computes the power spectrum in frequency domain of every sample.

Clip the left side of the power spectrum calculated by FFT as the input for WDDTL. Therefore, each input sample has 1000 measurements.
We proposed three transfer scenarios, including two unsupervised scenarios and one supervised scenario (refer to Table I), they are:
Scenario 








USSpeed  Unsupervised  US(A)US(B)  17971772rpm 




US(A)US(C)  17971750rpm  
US(A)US(D)  17971730rpm  
US(B)US(A)  17721797rpm  
US(B)US(C)  17721750rpm  
US(B)US(D)  17721730rpm  
US(C)US(A)  17501797rpm  
US(C)US(B)  17501772rpm  
US(C)US(D)  17501730rpm  
US(D)US(A)  17301797rpm  
US(D)US(B)  17301772rpm  
US(D)US(C)  17301750rpm  
USLocation  Unsupervised  US(E)US(F)  Drive EndFan End  
US(F)US(E)  Fan EndDrive End  
SLocation  Supervised  S(E)S(F)  Drive EndFan End 



S(F)S(E)  Fan EndDrive End 

Unsupervised transfer between motor speeds (USSpeed): For this scenario, we test the data with 12 KHz sampling frequency acquired at the drive end of the motor, and ignore the level of fault severities. Thus, we construct 4way classification tasks (i.e., health condition, and three fault conditions with faults on inner race, outer race and roller), across 4 domains with different motor speeds: 1797 rpm (US(A)), 1772 rpm (US(B)), 1750 rpm (US(C)), and 1730 rpm (US(D)). In total, for this scenario, we evaluate our proposed method over 12 transfer tasks.

Unsupervised transfer between datasets at two sensor locations (USLocation): For this scenario, we focus on domain adaptation between different sensor locations but ignore the level of fault severities and the differences in motor speeds. Again, we construct 4way classification tasks for health and three fault conditions, across 2 domains (2 tasks) where vibration acceleration data acquired by two sensors placed at the drive end (US(E)) and fan end (US(F)) of the motor housing respectively.

Supervised transfer between datasets at two sensor locations (SLocation): this scenario uses the same settings as the previous scenario USLocation, except for the specified change of adding a small amount of labeled data () of target domain in source domain which aims to enhance the classification performance.
To evaluate the efficiency of our proposed approach WDDTL on bearing fault diagnosis problem, other approaches are also tested on the same dataset for comparison purpose:

CNN (no transfer): This model is the pretrained network described in Section IIIB, which is trained based on the labeled source data and applied to test the classification result on the target domain directly.

DAN: We follow the idea in work [14], of which proposed a deep adaptation network (DAN) for learning transferable features via MKMMD in deep neural networks. The MMD metric is an integral probability metrics which measures the distance between two probability distributions via mapping the samples into a Reproducing Kernel Hilbert Space (RKHS). Domain adaptation via MMD has been explored for image classification in several works, see in [30, 16, 14].

In addition, to evaluate the feature extraction ability of CNN compared to the use of conventional statistical features. Results of traditional transfer learning methods using statistical (handcrafted) features [6], including transfer component analysis (TCA) [31], joint distribution adaptation (JDA) [32], and CORrelation ALignment (CORAL) [33], are also provided for comparison.
This work will mainly focus on the comparison between those deep transfer learning methods (DAN and WDDTL) and CNN.
IvB Implementation details
TensorFlow [34] is used as software framework for all our experiments using deep learning flow, and those models are all trained with Adam optimizer. We test each approach for five times over 5000 iterations and record the best result of each test. We take the averages and 95% confidential interval of classification accuracy for comparison. The sample size for motor speed tasks (A), (B), (C), and (D) are 1026, 1145, 1390, and 1149 respectively. The sample size for different sensor location tasks (E) and (F) are 3790 and 4710 respectively. The batch size is fixed as 32 for all experiments.
CNN: Our CNN architecture is comprised of two convolutional layers (Conv1Conv2), two maxpooling layers (Pool1Pool2), and two fullyconnected layers (FC1FC2). The activation function in output layer is Softmax while ReLu is used in convolutional layers. The neuron number in FC1 and FC2 are 128 and 4, respectively. Filters, kernel size, and stride of each layer can refer to Table II. Before transfer, we finetune the CNN models which achieve their best validation accuracies for all transfer scenarios.
Layer  Filters  Kernel size  Stride 

Conv1  8  1x20  2 
Pool1    1x2  2 
Conv2  16  1x20  2 
Pool2    1x2  2 
DAN: The convolutional layers (Conv1Conv2) of the CNN network is used to be the feature extractor. Then, to minimize the domain distance between the source and target domains, FC1 is used as the hidden layer for adaptation. The final representations of the hidden layer in both domains are embedded to RKHS to reduce the MKMMD distance. The final objective function is the combination of the MKMMD loss and the classification loss. Best classification accuracies are obtained for transfer scenarios by tuning the balancing coefficient for the discrepancy loss.
WDDTL: WDDTL method has been summarized in Fig. 1 and Algorithm 1. Similar to DAN, convolutional layers (Conv1Conv2) are used to extract features. The nodes of hidden layers in the domain critic network are set to 128 and 1, respectively. The training step is set to 10. The learning rates for the discriminator and the domain critic are and respectively. The gradient penalty is set to 10. Balance coefficient for optimizing the minimum operator is 0.1 and 0.8 for motor speed transfer and sensor location transfer, respectively.
In terms of the traditional transfer learning methods TCA, JDA and CORAL, the regularization term is chosen from {0.001 0.01 0.1 1.0 10 100}. SVM is used in TCA and CORAL for classification.
IvC Results and discussion
The results of transfer tasks for WDDTL and the other two approaches are shown in Table III. For the transfer task with unlabeled data set in target domain (i.e., scenario USSpeed and USLocation), we can observe that WDDTL significantly outperforms CNN with a large margin, which achieves approximately 13.6% and 25% increases in average accuracies for motor speed and sensor location transfer tasks, respectively. In addition, the WDDTL transfer accuracies are better than most of the DAN results (average 5% increase), except transfer task which result in less than 1% accuracy difference.
To summarize the results, we can make the following observations: 1) WDDTL achieves the best transfer accuracies with 95.75% average score, confirming the effectiveness of Wasserstein distance in learning transferable features using CNNbased model; 2) Without domain adaptation, CNN method already has the ability to achieve good classification performance for the motor speed transfer tasks, due to its excellent feature detection ability; 3) The accuracies of CNN, DAN and WDDTL on transfer tasks of scenario USLocation are not better than the transfer tasks of scenario USSpeed, due to the characteristics of signals obtained at different sensor location (Fan End and Drive End) are more different than the difference between motor speeds; and 4) The proposed WDDTL approach shows a good ability to solve supervised problem with a small number of labeled data. Supervised transfer tasks S(E) S(F) and S(E) S(F) are carried out using only 0.5% sample size of the unsupervised case, but achieve as good as performance compared to the unsupervised case which using 100% unlabeled sample. Further analysis of the effect of sample size for both supervised and unsupervised transfer learning will be shown in Section IVD2.
TCA  JDA  CORAL  CNN  DAN  WDDTL  
US(A)US(B)  26.55  65.07 ( 7.55)  59.18  82.75 ( 6.77)  92.97 ( 3.88)  97.52 ( 3.09) 
US(A)US(C)  46.80  51.31 ( 1.56)  62.14  78.65 ( 4.54)  85.32 ( 5.26)  94.43 ( 2.99) 
US(A)US(D)  26.57  57.70 ( 8.59)  49.83  82.99 ( 5.89)  89.39 ( 4.37)  95.05 ( 2.12) 
US(B)US(A)  26.63  71.19 ( 1.21)  53.57  84.14 ( 6.63)  94.43 ( 2.95)  96.80 ( 1.10) 
US(B)US(C)  26.60  69.80 ( 5.67)  57.28  85.41 ( 9.44)  90.43 ( 4.62)  99.69 ( 0.59) 
US(B)US(D)  26.57  88.50 ( 1.96)  60.53  86.09 ( 4.63)  87.37 ( 5.42)  95.51 ( 2.52) 
US(C)US(A)  26.63  56.42 ( 2.52)  54.03  76.50 ( 3.76)  89.88 ( 1.57)  92.16 ( 2.61) 
US(C)US(B)  26.66  69.18 ( 1.90)  76.66  82.75 ( 5.51)  92.93 ( 1.57)  96.03 ( 6.27) 
US(C)US(D)  46.75  77.45 ( 0.83)  70.34  87.04 ( 6.81)  90.66 ( 5.24)  97.56 ( 3.31) 
US(D)US(A)  46.74  61.72 ( 5.48)  59.78  79.23 ( 6.96)  90.88 ( 1.82)  89.82 ( 2.41) 
US(D)US(B)  46.79  74.03 ( 0.86)  59.73  79.73 ( 5.49)  87.91 ( 2.42)  95.16 ( 3.67) 
US(D)US(C)  26.60  65.24 ( 4.18)  63.02  80.64 ( 4.23)  92.94 ( 3.96)  99.62 ( 0.80) 
Average  33.32  67.35 ( 3.53)  56.01  82.10 ( 5.89)  90.42 ( 3.59)  95.75 ( 2.62) 
US(E)US(F)  19.05  57.35 ( 0.47)  47.97  39.07 ( 2.22)  56.89 ( 2.73)  64.17 ( 7.16) 
US(F)US(E)  20.45  66.34 ( 4.47)  39.87  39.95 ( 3.84)  55.97 ( 3.17)  64.24 ( 3.87) 
Average  19.75  61.85 ( 2.47)  43.92  39.51 ( 3.03)  56.43 ( 2.95)  64.20 ( 5.52) 
S(E)S(F)  20.43  65.48 ( 0.57)  51.77  54.04 ( 7.67)  59.68 ( 4.61)  65.69 ( 3.74) 
S(F)S(E)  19.02  59.07 ( 0.56)  47.88  50.47 ( 5.74)  58.78 ( 5.67)  64.15 ( 5.52) 
Average  19.73  62.28 ( 0.57)  49.83  52.26 ( 6.71)  59.23 ( 5.14)  64.92 ( 4.63) 
IvD Empirical Analysis
IvD1 Feature visualization
To further evaluate the transfer performance of the proposed WDDTL framework, tdistributed stochastic neighbor embedding (tSNE) is employed to perform the nonlinear dimensionality reduction for network visualization. For comparison purpose, CNN and DAN transfer results for same tasks are also presented.
For transfer tasks between motor speeds, i.e., scenario USSpeed, we randomly choose task to visualize the learned feature representations under different motor speeds. Fig. 3 shows the comparison results. It can be observed that the clusters in Fig. 3(c) formed by our proposed WDDTL are better separated than the CNN network result in Fig. 3(a) that was not trained for domain adaptation and the DAN domain adaptation result in Fig. 3(b). For example, in Fig. 3(a) with CNN approach, three types of fault features are inspected with large overlapped areas, and some outerrace faults (yellow color with label 2) fall into other fault types. Similarly, in Fig. 3(b) with DAN approach, outerrace faults is also hardly be separated from other fault types. With our WDDTL approach, four conditions are clearly separated into different clusters. More importantly, we can observe the obvious improvement of domain adaptation due to source and target domain features are almost mixed into the same cluster.
For transfer tasks between different sensor locations, i.e., scenarios USLocation and SLocation, tSNE results of transfer task are in Fig. 4 provided. It can be viewed that even WDDTL shows better clustering result than CNN and DAN, faults types 1, 2, and 3 are hard to be separated clearly into individual clusters. It must be emphasized that above results are carried out by using 100% (4710) sample size in target domain, and even in this case the performance is not satisfied enough. This raise the problem of how to enhance the transfer learning performance when signals in source and target domains are relevant but not similar enough. We investigate this problem in the next subsection.
IvD2 Effect of sample size on unsupervised and supervised accuracy
Next, we investigate the influence of data size on transfer task accuracy for our proposed method WDDTL. For each sample number tested, same experiment is repeated five times and transfer learning accuracies are recorded. As it has known that our propose WDDTL method already achieved very good performance (average 95.75% accuracy in Table III) for unsupervised transfer scenario USSpeed. Fig. 5 displays the accuracy variation curve for WDDTL of tasks and with respect to scenario USLocation and SLocation. Diagnosis accuracies will be saturated around a fixed value when sample number larger than 2500, thus we only show the result from 10 to 2500.
In Fig. 5(a), it can be observed that the accuracy of WDDTL is increased from 59.47% and the final test accuracy is confined around 64%. While the sample number is increasing, fault diagnosis accuracies of WDDTL approach are all higher than DAN and CNN. This analysis reveals that, for this unsupervised scenario, the increase of sample number could improve the transfer learning accuracy, however, the improvement is limited (less than 5%) even with 100% sample number in target domain. To solve this problem, in Fig. 5(b), we employ a small amount of labeled data to improve the fault diagnosis accuracy, which is associated with the case with limited labeled data in real industry application. The plot shows that when the labeled sample size larger than 20 of 4710 the transfer learning accuracy of WDDTL will surpass the case in Fig. 5(a) with 100% sample size (blue zone in Fig. 5(a)). More specifically, only using 100 labeled sample, (equivalent to 25 for each fault categorization) could achieve 80% transfer learning accuracy, indicating our proposed WDDTL is also an optimal framework for supervised transfer task.
Based on the above discussions, we hereby offer two solutions for manufacturers of using the proposed WDDTL approach: 1) when facing the transfer tasks between similar signals in source and target domains, such as transfer learning between different motor speeds, unsupervised transfer learning with unlabeled data is enough to obtain very good fault diagnosis accuracy (larger then 95%); and 2) when facing the transfer tasks between relevant signals but not similar enough, such as transfer learning between different sensor locations, a small amount of labeled sample will greatly improve the transfer learning accuracy compared to the unsupervised case with large amount of unlabeled sample data.
IvD3 Algorithm robustness evaluation
The robustness of our proposed algorithm WDDTL is investigated and compared with CNN and DAN approaches. We run each task for five times and store the transfer accuracy of each task. Fig. 6 gives an illustration of the variation of transfer task accuracy on 12 tasks of motor speed transfer scenario. We can observe that not only the WDDTL accuracy is higher than other two approaches but also it has a narrower 95% confidential interval than other two approaches. This confirms our motivation of using CNNbased network and Wasserstein distance for domain adaptation, since both the accuracy and model robustness of feature transferability is enhanced by using our proposed algorithm.
During our experiments, we also found that the robustness of transfer model for mechanical system is worse than image classification transfer model, which might due to the large noise in the acquired acceleration signals. In our future work, this might can be solved by employing some basic signal processing techniques to filter the noise.
V Conclusion
To achieve intelligent fault diagnosis, we proposed a novel Deep Transfer Learning architecture via Wasserstein Distance (WDDTL) to enhance the domain adaptation ability. WDDTL is constructed based on a deep learning flow (CNN architecture) to extract features and introduces a domain critic to learn domain invariant feature representations. Through an adversarial training process, WDDTL significantly reduce the domain discrepancy thanks to its gradient property of Wasserstein distance over other stateofthearts distances and divergences. Our proposed method is tested on a CRWU benchmark bearing fault diagnosis dataset and compared with the base CNN model, DAN metric and other traditional transfer learning methods over 16 transfer tasks. Performance of all the transfer tasks demonstrate that WDDTL outperforms other approaches with much better classification accuracies. Empirical results also show that 1) our proposed method achieves higher robustness for motor speed transfer tasks, and 2) WDDTL is a novel approach which could contribute to solve both unlabeled and insufficient labeled data problems in real industry applications. Future work includes investigating more transfer scenarios (e.g. transfer learning between different machines) for intelligent fault diagnosis and optimizing the architecture of our proposed algorithm.
References
 [1] S. J. Bae, B. M. Mun, W. Chang, and B. Vidakovic, “Condition monitoring of a steam turbine generator using wavelet spectrum based control chart,” Reliability Engineering & System Safety, 2017.
 [2] B. Samanta and K. AlBalushi, “Artificial neural network based fault diagnostics of rolling element bearings using timedomain features,” Mechanical Systems and Signal Processing, vol. 17, no. 2, pp. 317–328, 2003.
 [3] P. Tamilselvan and P. Wang, “Failure diagnosis using deep belief learning based health state classification,” Reliability Engineering & System Safety, vol. 115, pp. 124–135, 2013.
 [4] Z. Chen and W. Li, “Multisensor feature fusion for bearing fault diagnosis using sparse autoencoder and deep belief network,” IEEE Transactions on Instrumentation and Measurement, vol. 66, no. 7, pp. 1693–1702, 2017.
 [5] H. Hu, B. Tang, X. Gong, W. Wei, and H. Wang, “Intelligent fault diagnosis of the highspeed train with big data based on deep neural networks,” IEEE Transactions on Industrial Informatics, vol. 13, no. 4, pp. 2106–2116, 2017.
 [6] L. Guo, Y. Lei, S. Xing, T. Yan, and N. Li, “Deep convolutional transfer learning network: A new method for intelligent fault diagnosis of machines with unlabeled data,” IEEE Transactions on Industrial Electronics, 2018.
 [7] C. Sun, M. Ma, Z. Zhao, S. Tian, R. Yan, and X. Chen, “Deep transfer learning based on sparse autoencoder for remaining useful life prediction of tool in manufacturing,” IEEE Transactions on Industrial Informatics, 2018.
 [8] R. Gopalan, R. Li, and R. Chellappa, “Domain adaptation for object recognition: An unsupervised approach,” in Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011, pp. 999–1006.
 [9] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa, “Visual domain adaptation: A survey of recent advances,” IEEE Signal Processing Magazine, vol. 32, no. 3, pp. 53–69, 2015.
 [10] J. Deng, Z. Zhang, F. Eyben, and B. Schuller, “Autoencoderbased unsupervised domain adaptation for speech emotion recognition,” IEEE Signal Processing Letters, vol. 21, no. 9, pp. 1068–1072, 2014.
 [11] Y. Yao and G. Doretto, “Boosting for transfer learning with multiple sources,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE conference on. IEEE, 2010, pp. 1855–1862.
 [12] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Advances in Neural Information Processing Systems, 2014, pp. 3320–3328.
 [13] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep domain confusion: Maximizing for domain invariance,” arXiv preprint arXiv:1412.3474, 2014.
 [14] M. Long, Y. Cao, J. Wang, and M. I. Jordan, “Learning transferable features with deep adaptation networks,” in Proceedings of the 32nd International Conference on International Conference on Machine LearningVolume 37. JMLR. org, 2015, pp. 97–105.
 [15] M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Deep transfer learning with joint adaptation networks,” in International Conference on Machine Learning, 2017, pp. 2208–2217.
 [16] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel twosample test,” Journal of Machine Learning Research, vol. 13, no. Mar, pp. 723–773, 2012.
 [17] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
 [18] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in Advances in Neural Information Processing Systems, 2017, pp. 5767–5777.
 [19] J. Shen, Y. Qu, W. Zhang, and Y. Yu, “Adversarial representation learning for domain adaptation,” arXiv preprint arXiv:1707.01217, 2017.
 [20] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel, “Handwritten digit recognition with a backpropagation network,” in Advances in Neural Information Processing Systems, 1990, pp. 396–404.
 [21] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551, 1989.
 [22] D. H. Hubel and T. N. Wiesel, “Receptive fields of single neurones in the cat’s striate cortex,” The Journal of Physiology, vol. 148, no. 3, pp. 574–591, 1959.
 [23] A. M. Saxe, P. W. Koh, Z. Chen, M. Bhand, B. Suresh, and A. Y. Ng, “On random weights and unsupervised feature learning,” in Proceedings of the 28th International Conference on International Conference on Machine Learning. Omnipress, 2011, pp. 1089–1096.
 [24] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
 [25] T. P. Vogl, J. Mangis, A. Rigler, W. Zink, and D. Alkon, “Accelerating the convergence of the backpropagation method,” Biological Cybernetics, vol. 59, no. 45, pp. 257–263, 1988.
 [26] C. Cheng, G. Ma, Y. Zhang, M. Sun, F. Teng, H. Ding, and Y. Yuan, “Online bearing remaining useful life prediction based on a novel degradation indicator and convolutional neural networks,” Submitted to IEEE Transactions on Mechatronics, 2018.
 [27] R. Girshick, “Fast rcnn,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440–1448.
 [28] C. Villani, Optimal transport: old and new. Springer Science & Business Media, 2008, vol. 338.
 [29] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domainadversarial training of neural networks,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016.
 [30] D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu, “Equivalence of distancebased and rkhsbased statistics in hypothesis testing,” The Annals of Statistics, pp. 2263–2291, 2013.
 [31] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE Transactions on Neural Networks, vol. 22, no. 2, pp. 199–210, 2011.
 [32] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer feature learning with joint distribution adaptation,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2200–2207.
 [33] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
 [34] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: a system for largescale machine learning,” in Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation. USENIX Association, 2016, pp. 265–283.