1 Introduction
In recent years, deep learning has been successfully applied in fault diagnosis, and it has become a new research hotspot in datadriven methods hoang2019survey; wang2018deep; zhao2019deep. Deep learning has a strong ability to extract features and it’s easy and effective to establish an endtoend fault diagnosis system khan2018review. Deep learning methods such as Convolutional Neural Network (CNN) li2019understanding; abdeljaber20181; guo2018deep; li2020fault, Recurrent Neural Network (RNN) lei2019fault; liu2018fault, AutoEncoder (AE) shao2018novel; yu2019selective and Capsule Network (CN)chen2019deep; zhu2019convolutional has proven effective on multiple problems. However, though deep learning system is powerful and easy to build, designing neural network architecture needs rich professional knowledge and debugging experience. To obtain an optimal architecture, a lot of experiments are needed, which leads to the timeconsuming development of deep learning systems. What we want is an automated machine learning (AutoML) system that can automatically design neural network and adjust hypeparameters.
Fortunately, as a branch of AutoML, neural architecture search (NAS) is developing rapidly, and has become a new direction for deep learningelshawi2019automated; elsken2018neural; zoller2019survey. The general process of the NAS is shown in Figure 1. Given a specific learning task and a search space, NAS can automatically search for the optimal neural architecture. Generally speaking, a search strategy selects an architecture form predefined search space, then the selected architecture is evaluated by an estimation strategy. Next, the search strategy is updated according to the evaluation results. Repeat the above process, and finally get an optimal network architecture elsken2018neural.
In this paper, we propose a NAS method for fault diagnosis using reinforcement learning. A RNN is used as an agent (controller) to generate architectures that are trained with training dataset. And these architectures are evaluated with validation dataset to get accuracy. The accuracy is seen as a reward to controller, and the parameters of controller are updated using strategy gradient algorithm. We also utilize parameters sharing trick to accelerate the search process. The proposed method is proved to be effective on PHM 2009 Data Challenge gearbox dataset. Our contributions can be summarized into following aspects.

We applied NAS in fault diagnosis for the first time. A reinforcement learning based NAS framework is developed to search for the optimal architecture.

We put forward several problems and challenges in the application of NAS and AutoML in fault diagnosis, and point out several directions for future research.
2 Related Work
Fault diagnosis using deep learning. Deep learning has been widely applied in fault diagnosisabdeljaber20181; guo2018deep; lei2019fault; li2019understanding; li2020fault; shao2018novel; yu2019selective, and recently some novel network structures are proposed. zhu2019convolutional proposes a novel capsule network with an Inception block for fault diagnosis. First signals are transformed into a timefrequency graph, and two convolution layers are applied to preliminarily extract features. Then an inception block is applied to improvethe nonlinearity of the capsule. After dynamic routing, the lengths of the capsules are used to classify the fault category. In order to obtain diversity resolution expressions of signals in frequency domain, huang2019improved proposes a new CNN structure named multiscale cascade convolutional neural network (MCCNN). MCCNN uses the filters with different scales to extract more useful information. To solve the problem that proper selection of features requires expertise knowledge and is timeconsuming, pan2017liftingnet proposes a novel network named LiftingNet to learn features adaptively without prior knowledge. LiftingNet introduced split layer, predict layer and update layer. And different kernel sizes are applied to improve learning ability.
Neural architecture search. The first influential job of NAS is zoph2016neural. In this paper, author uses a RNN to generate the descriptions of neural networks, and train the RNN with reinforcement learning to maximize their excepted accuracy on validation dataset. The proposed method not only generate CNN, but also generate Long ShortTerm Memory network (LSTM) cell. pham2018efficient proposes a fast and inexpensive method named Efficient Neural Architecture Search (ENAS). This approach uses sharing parameters among child models to greatly reduce search time than above standard NAS. brock2017smash employs an auxiliary HyperNet to generates the weights of a main model with variable architectures. And a flexible scheme based on memory readwrites is developed to define a diverse range of architectures. Unlike above approaches searching on a discrete and nondifferentiable search space, liu2018darts proposes a differentiable architecture search method named DARTS. This approach uses gradient descent to search architectures by relaxing the search space to be continuous.
3 Methods
According to zoph2016neural, neural network can be typically specified by a variablelength string, so it can be generated by RNN. In this section, we will use a RNN as controller to generate a CNN with reinforcement learning. Given a search space, CNN can be designed by RNN, and RNN is trained with a policy gradient method to maximize the expected accuracy of the generated architectures.
3.1 Search Space
Our method searches for the optimal convolution kernel combination in a fixed network structure. Several typical network structure can be selected such as Inception structure, ResNet structure, DenseNet structure and so on. In this paper, we search the optimal architecture in a ResNet structure which is shown in Figure 3. The inputs are first feed into a fixed stem layer, and then followed by several residual blocks, where convolutional kernel of each layer is generated by RNN. Finally, global average pooling layer flattens the features maps and classifier outputs the classification probability. There are two layers and a skip connection in a residual block. Here we use six different convolutional kernels:

kernel with dilation rate

kernel with dilation rate

kernel with dilation rate

kernel with dilation rate

kernel with dilation rate

kernel with dilation rate
Dilated convolution is to inject holes into the standard convolution kernel to increase the receptive field yu2015multi. Compared with the standard convolution operation, the dilated convolution has one more hyperparameter called dilation rate , which refers to the number of kernel intervals. An example of dilated convolution compared with standard convolution is shown in Figure 2.
In this paper, we set 4 blocks in the ResNet structure, each layer has 6 different convolution kernels to choice, so there are possible architectures. Our aim is to search the optimal architecture in such a large search space.
3.2 Designing CNN using Recurrent Neural Network
Since a neural network can be encoded by a variablelength string, it’s possible to use RNN, a controller to generate such string. Here, six different convolution kernels are encoded as Numbers , so different combinations of Numbers represent different network architectures. In this paper, we use LSTM to generate such Numbers combinations, as shown in Figure 3. For LSTM, the output probability distribution of six convolution kernels is obtained by softmax, and a certain kernel is sampled form such distribution. For example, for the first layer of CNN, the controller outputs a softmax probability distribution: . And the probability of the fourth convolution kernel being sampled is 0.3, and it is most likely to be sampled. Then this sampled convolution kernel is the convolution operation of the first CNN layer. Next, the embedding of sampled Number is used as input to the LSTM to generate the convolution kernel of the next layer. And so on, until the convolution kernels of all layers are generated.
3.3 Training With Reinforcement Learning
In reinforcement learning, there are two main parts: agent and environment. Agent gets rewards by interacting with the environment to learn the corresponding strategies. In reinforcement learning based NAS, agent is the RNN controller, environment is the search space, the validation accuracy of sampled model is reward. The generated CNN architecture by controller is trained using training dataset , and this CNN is evaluated using validation dataset to get reward . Then controller is updated using the reward. To find optimal architecture, we need to maximize the expected reward of controller:
(1) 
Where is the parameters of controller, is a list of convolution kernels sampled by controller to generate a CNN, is the probability that is sampled. But the reward signal is not differentiable, we use the policy gradient algorithm to iteratively update williams1992simple:
(2)  
An empirical approximation of the above quantity is:
(3) 
Where is the number of different architectures that the controller samples in one batch, is the number of convolution kernels our controller has to predict in each CNN, and is the reward of th sampled architecture. Above updating rule is an unbiased estimate and has a very high variance. In order to reduce the variance of this estimate, we use a baseline function to this updating rule zoph2016neural:
(4) 
Where is an exponential moving average of the previous architecture validation accuracies.
3.4 Accelerate Training using Parameters Sharing
As we all know, training a neural network from scratch is timeconsuming. In the process of search, a sampled architecture need to be trained from scratch to obtain it’s reward. This can be very timeconsuming and inefficient when the number of search epochs is particularly large. To reduce the cost of searching, the weight sharing mechanism is applied in training process. We don’t train the sampled architecture form scratch, but train the model with only one minibatch data, and the trained convolution kernels will be reused in next search epoch pham2018efficient. There are many repeated convolution operations among architectures, and weight sharing can prevent them from being repeatedly trained. This greatly improves the efficiency of search process.
3.5 Neural Architecture Search Pipeline
In each search epoch, RNN will generate a number of architectures according to the output probability distribution. These architectures will be trained with signal minibatch training data, and their rewards are obtained using validation data. Then the controller is update according to Eq. (4). Above search process is then repeated until the maximum number of search epochs is reached. Finally, architectures are generated by the trained controller, and the architecture with the highest validation accuracy is selected as the final architecture, and trained from scratch. The whole process of neural architecture search for fault diagnosis is shown is Algorithm 1 and Algorithm 2.
4 Experiments
4.1 PHM2009 Dataset
In this paper, we use gearbox dataset of PHM2009 Data Challenge to study NAS for fault diagnosis. This dataset is a typical industrial gearbox data which contains 3 shafts, 4 gears and 6 bearings. Two sets of gears, spur gears and helical gears are tested. There are six labels in this dataset. For each label, there are 5 kinds of shaft speed, 30, 35, 40, 45 and 50 Hz, and two modes of loading, high and low loading. We do not distinguish between these working conditions under each label. The raw vibration signals of this dataset are very long, so we use a sliding window with a length of 1000 and a step length of 50 to segment the signals to obtain training, validation and testing samples. These signals are normalized to . Finally, we obtain 22967 training samples, 2552 samples and 6380 samples. An vibration signals example of six labels is shown Figure 4.
4.2 Training Details
For the controller, we set the input size and hidden size of LSTM to be 64, number of layer to be 1. We use Stochastic Gradient Descent (SGD) with learning rate of 0.01 to train the controller. In each search epoch, we train the controller for steps. In each train step, we sample architectures. For the sampled architecture training, we use Adam with learning rate of 0.001 and L2 regularization of . For the ResNet structure, we set 4 residual blocks and each block contains 2 layers. Each layer is followed by a downsampling layer with a convolution kernel of and s step size of 2. The number of channels in the first block is 8, then doubles as it passes through the downsampling layer. We set search epochs , batch size to be 128. After the search, architectures are sampled to be evaluated, and the architecture with highest validation accuracy is found as the final model. The final model will be trained from scratch using Adam with learning rate of 0.001 and batch size of 128. The final result of searched model is evaluated on testing dataset. In the above process, learning rate is adjusted using CosineAnnealingLR. The code is implemented using PyTorch 1.3, using a signal Tesla K80 GPU. The whole search time is 1.6 hours.
4.3 Results
Table 4.3 summarizes the results of NAS and six manually designed models. All layers of M1 model are the first convolution kernel in the search space, M2 is the second, and so on. Note that our method is searching in the ResNet structure, so all compared models are variants of ResNet. We can also search the architectures in Inception structure, DenseNet structure and so on.
The searched architecture is shown in Figure 5, and it achieved accuracy of 78.91%. Six manually designed models achieved at most accuracy of 76.22%. This indicates that the method of NAS based on reinforcement learning is effective, and the controller gets rewards through the sampled models and constantly update the parameters in the direction of obtaining more excellent models. In addition, after each search epoch, 50 architectures were sampled to get their validation accuracy. Figure 6 shows the trends of those accuracy rates throughout the search process. We can see that accuracies increase gradually. It indicates that the repeated use of convolution kernels is effective, and it does improve the performance and stability of the entire ResNet structure to accurately evaluate the performance of sampled architectures.
5 Discussions
In this paper, we have initially shown the application of NAS in fault diagnosis and proved its effectiveness. However, the application of NAS in fault diagnosis has just started, and there are still many challenges to realize the automatic design of deep learning models for fault diagnosis. We have summarized the following problems to be solved:

In this paper, we just search the optimal architecture in ResNet structure, which has great limitations. We are more interested in how to automatically design more novel and complex structures, not limited by the existing structures or the number of layers, and the searched models have better performance. It is currently the most challenging problem.

Reinforcement learning based NAS is also a proxy NAS that will cost a lot of time. In this paper, due to the small dataset, the small network size, and the small number of training epochs, the entire search took only 1.6 hours, of which the time of training controller accounts for a large part. How to use more effective search methods for fault diagnosis is the problem that needs to be solved urgently.

In this paper, we only evaluate the testing accuracy of sampled architectures, but did not focus on the amount of parameters of the searched architectures (which determines the storage space occupied by the model) and the amount of calculation (which determines the speed of the model). When the model is deployed in an embedded terminal, the amount of parameters and calculations become very important. How to search for a model with a small number of parameters and a small amount of calculation but with high accuracy is also a difficult problem for future research.

Not only limited to neural architecture search, realizing the automation of machine learning in fault diagnosis is a wider and more difficult problem. The data of industrial equipment is huge and complicated, the preprocessing of data is difficult, and the working conditions are changing. From data collection to data preprocessing, to feature engineering and modeling, to model testing and tuning parameters, the entire development process cycle is long and time consuming. Automating machine learning in the field of PHM is a more difficult challenge.
6 Conclusions
In this paper, we develop a method of neural architecture search for fault diagnosis. It’s the first time that NAS technology has been used to automatically generate deep learning model for fault diagnosis. We use RNN as a controller to generate architectures in ResNet search space, and train the controller with reinforcement learning. Results show that NAS is effective to find a model with better performance than manually designed models.
The work of Yang Hu is supported by the National Nature Science Foundation of China, grant number 61703431. The work of Mingtao Li was supported by the Youth Innovation Promotion Association CAS. The computing platform is provided by the STARNET cloud platform of National Space Science Center Public Technology Service Center.