Autonomously and Simultaneously Refining Deep Neural Network Parameters by Generative Adversarial Networks
The choice of parameters, and the design of the network architecture are important factors affecting the performance of deep neural networks. However, there has not been much work on developing an established and systematic way of building the structure and choosing the parameters of a neural network, and this task heavily depends on trial and error and empirical results. Considering that there are many design and parameter choices, such as the number of neurons in each layer, the type of activation function, the choice of using drop out or not, it is very hard to cover every configuration, and find the optimal structure. In this paper, we propose a novel and systematic method that autonomously and simultaneously optimizes multiple parameters of any given deep neural network by using a generative adversarial network (GAN). In our proposed approach, two different models compete and improve each other progressively with a GAN-based strategy. Our proposed approach can be used to autonomously refine the parameters, and improve the accuracy of different deep neural network architectures. Without loss of generality, the proposed method has been tested with three different neural network architectures, and three very different datasets and applications. The results show that the presented approach can simultaneously and successfully optimize multiple neural network parameters, and achieve increased accuracy in all three scenarios.
Keywords:Deep learning, neural networks, parameter choice, generative adversarial networks
Deep learning-based techniques have found widespread use in machine learning. Even before the convolutional approaches becoming popular, simple multi-layer perceptron neural networks (MLPNN) had been widely used for classification tasks for several reasons. First, they are very easy to construct and run since each layer is represented and operated as a single matrix multiplication. Second, a neural network can become a complex non-linear mapping between input and output by the introduction of non-linear activation functions after each layer. Regardless of how large the input size (i.e. the number of features) or the output size (i.e. the number of classes) are, a neural network can discover relations between them when the network is sufficiently large; and enough samples, which cover the problem domain as much as possible, are provided during training.
Although a MLPNN is successful to form a complex non-linear relationship between the feature space and the output class space, it lacks the ability of discovering features by itself. Until recent years, the classical approach was to provide either the input data directly or some high-level descriptors, which are extracted by applying some algorithm on the input data, as the source of features. Using the raw input as features does not guarantee to yield any satisfactory mappings and the latter approach would require the investigation of multiple hand-crafted descriptor extraction algorithms for different applications. The introduction of convolutional layers in neural networks removed the necessity of having prior feature extractors, which are not easy to craft for different applications. Convolutional layers are designed to extract features directly from the input. Since they have been proven to be successful feature extractors and thanks to much faster computation of the operations of convolutional neural networks (CNNs) on specialized processors such as GPUs, the use of CNNs exploded recently. After Krizhevksy et al.  achieved a significant increase in the classification accuracy on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)  in 2012, many others followed their approach, by creating different architectures and applying them to numerous applications in many different domains.
It is well-known that the training of deep learning methods requires large amounts of data, and they usually perform better when training data size is increased. However, for some applications, it is not always possible to obtain more data when the dataset at hand is not large enough. In many cases, even though the raw data can be collected easily, the labeling or annotation of the data is difficult, expensive and time consuming. Successors of  yielded better accuracies with less number of parameters on the same benchmark with some architectural modifications using the same building blocks. This shows that the choice of parameters, and the design of the architecture are important factors affecting the performance. In fact, the design of a CNN model is very important to achieve better results, and many researchers have been working hard to find better CNN architectures [3, 4, 5, 6, 7, 8, 9] to achieve higher accuracy.
However, there has not been much work on developing an established and systematic way of building the structure of a neural network, and this task heavily depends on trial and error, empirical results, and the designer’s experience. Considering that there are many design and parameter choices, such as the number of layers, number of neurons in each layer, number of filters at each layer, the type activation function, the choice of using drop out or not and so on, it is not possible to cover every possibility, and it is very hard to find the optimal structure. In fact, often times some common settings are used without even trying different ones. Moreover, the hyper-parameters in training phase also play important role on how well the model will perform. Likewise, these parameters are also tuned manually in an empirical way most of the time.
In this work, we focus on optimizing the network architecture and training parameters for any given neural network model. We propose a novel and systematic way, which employs generative adversarial networks (GANs) to find the optimal structure and parameters.
1.1 Related Work
There have been works focusing on optimizing neural network architectures. Most of the proposed approaches are based on the genetic algorithms (GA), or evolutionary algorithms, which are heuristic search algorithms. Benardos and Vosniakos  proposed a methodology for determining the best neural network architecture based on the use of a genetic algorithm and a criterion that quantifies the performance and the complexity of a network. In their work, they focus on optimizing four architecture decisions, which are the number of layers, the number of neurons in each layer, the activation function in each layer, and the optimization function. Islam et al.  also employ a genetic algorithm for finding the optimal number of neurons in the input and hidden layers. They apply their approach to power load prediction task and report better results than a manually designed neural network. However, their approach is used to optimize only the number of neurons for input and hidden layers, and optimization of other important design decisions such as the number of layers or type of activation functions are not discussed. Stanley and Miikkulainen  presented the NEAT algorithm for optimizing neural networks by evolving topologies and weights of relatively small recurrent networks. In a recent work, Miikkulainen et al.  proposed CoDeepNEAT algorithm for optimizing deep learning architectures through evolution by extending existing neuroevolution methods to topology, components and hyperparameters. Ritchie et al.  proposed a method to automate neural network architecture design process for a given dataset by using genetic programming. The genetic algorithm-based optimization uses a given set of blueprints and models, i.e. it performs a finite search over a discrete set of candidates. Thus, genetic algorithms, in general, cannot generate unseen configurations, and they can only make a combination of preset parameters.
Apart from the genetic algorithms, Bergstra and Bengio  have proposed random search for hyper-parameter optimization, and stated that randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid. Yan and Zhang  optimized architectures’ width and height with growing running time budget through submodularity and suparmodularity.
Generative Adversarial Networks (GANs)  are one of the important milestones in deep learning research. In contrast to CNNs, which extract rich and dense representations of the source domain, and may eventually map source instances into some classes, GANs generate instances of the source domain from small noise. They employ deconvolution operators, or transposed convolutions, to generate N-D instances from 1-D noise. GAN’s power comes from the competition with the discriminator, which decides whether the generated instance belongs to the source domain. Discriminator acts like the police who is trying to intercept counterfeit money, where in this case the generator is the counterfeiter. Generator and discriminator are trained together until discriminator cannot distinguish the generated instances from the instances in the source domain. GANs have been adapted in many applications [18, 19, 20, 21].
In our proposed approach, two different models compete and improve each other progressively with a GAN-based strategy. Our proposed approach can be used to autonomously refine the parameters, and improve the accuracy of different deep neural networks. For this work, we have tested the performance of our approach on three different neural network structures covering Long Short Term Memory (LSTM) networks and 3D CNNs, and different applications. Without loss of generality, we have chosen simpler network structures (not necessarily very deep ones) to optimize in order to show that the performance improvement is obtained not because of increasing number of layers, but instead thanks to better refinement and optimization of the network parameters.
2 Proposed Method
We propose a novel and systematic way, which employs generative adversarial networks (GANs) to find the optimal network structure and parameters. The proposed GAN-based network for refining different neural network parameters is shown in Fig. 1. It is composed of a generative part, an evaluation part and a discriminator. There are two generators ( and ), two evaluators ( and ), and one discriminator (). The input to the two generators is Gaussian noise . On the other hand, the input to the evaluators is the training data .
As will be discussed in more detail below, the generators and have the same network structure. From input noise , and generate the input network parameters and to be used and evaluated by and , respectively. and have the structure of the neural network whose parameters are being optimized or refined. They calculate the classification accuracy on the training data . represents the classification accuracy obtained by the evaluator when the parameters are used. The generator resulting in higher accuracy is marked as more accurate generator , and the other generator is marked as , where .
We define the discriminator as a network, which is used for binary classification between better generator and worse generator. and are fed into the discriminator , and the ground truth label about which is the better generator comes from the evaluators. The discriminator provides the gradients to train the worse performing generator.
The details of the proposed method are described in Sec. 2.2, and the pseudo code is provided in Algorithm 1.
2.2 Details of the proposed network
2.2.1 Generative part:
The two generators and have the same neural network structure shown in Fig. 3. Their input is a Gaussian noise vector , and their outputs are and . As seen in Fig. 3, generators are composed of fully connected layers with leaky relu activations. At the output layer, is employed so that , where and . Then, the range of is changed from to by using
In (1), and are preset maxima and minima values, which are defined empirically based on values that a certain parameter can take, so that the value of the refined parameters can only change between and . The re-scaled values and are then used as parameters of evaluator networks. The length of is determined by the number of network parameters that are refined, and is set at the generator network’s last fully connected layer.
Generators are trained/improved by the discriminator, which is a binary classifier used to differentiate the results from generator outputs and . Labels “a” and “b” represent the generators with higher accuracy and lower accuracy results, respectively. The generator, which has the worse performance and is labeled by “b”, is trained by stochastic gradient descent (SGD) from the discriminator to minimize by using
where is the number of epochs.
When becomes equal to for two consecutive iterations, the weights of will be re-initialized to default random values. The purpose of this step is to prevent the optimization stopping at a local maxima and also prevent the vanishing tanh gradient problem.
2.2.2 Evaluation part:
As mentioned above, one of the strengths of the proposed approach is that it can be used to refine/optimize parameters of different deep neural network structures. In other words, the evaluator networks have the same structure as the neural network whose parameters are being optimized or refined. As will be shown in Sec. 3, we have tested the proposed approach with three different network structures, and different sets of parameters.
Evaluator networks are built by using the parameters and provided by the generators. The training data is used to evaluate these network models. We employ an early stopping criteria. More specifically, if no improvement is observed in epoches, the training is stopped.
We then obtain the accuracies , . Let be the value of resulting in higher accuracy, and . Then “a” is used as the ground truth label for the discriminator, which marks the generator with better parameters, and trains the worse generator .
We define the discriminator as a network, whose output is a scalar softmax output, which is used for binary classification between better generator and worse generator. and are fed into the discriminator , and the ground truth label about which is better generator comes from the evaluators. Let represent the probability that came from the more accurate generator rather than . We train to maximize the probability of assigning the correct label to the outputs and of both generators. Moreover, we simultaneously train the worse generator to minimize . The whole process can be expressed by:
where, , .
The pseudo-code for the entire process is provided in Algorithm 1.
3 Experimental Results
In order to show the promise of the proposed approach in autonomously and simultaneously refining multiple deep neural network parameters, we tested its performance, without loss of generality, with three different neural network structures, and different training data types and applications. Below, we describe the details of each scenario. The network architectures, whose parameters are optimized, are shown in Figures 7, 7 and 7. In these figures, the parameters that are being refined/optimized are highlighted in red.
3.1 Experiments with ModelNet
We applied the proposed approach on a 3D convolutional network by using the ModelNet40 dataset . ModelNet is a dataset of 3D point clouds. The goal is to perform shape classification over 40 shape classes. Some example voxelized objects from the ModelNet40 dataset are shown in Fig. 4.
The 3D CNN model shown in Fig. 7 is used for evaluators. The output of each generator is a 9-dimensional vector which is composed of different parameter settings. More specifically, two of the parameters are the number of neurons for two fully connected layers. Six of the parameters indicate the choice of activation function for fully connected and convolutional layers from (‘Sigmoid’, ‘Relu’, ‘Linear’, ‘Tanh’) functions. One of the nine parameters is a flag indicating whether to add a dropout layer between fully connected layers. In this case, and are set to be: [4000, 4000, 4, 4, 4, 4, 4, 4, 1] and [1, 1, 0, 0, 0, 0, 0, 0, 0], respectively. Selecting the number of neurons is a regression problem and choosing the activation function is a classification problem. In other words, for choosing the activation function, the output is put into bins, and the corresponding function is selected.
The accuracy over number of epochs is shown in Fig. 8. The blue and red lines show the accuracies for Generator 1 and Generator 2, respectively. Green line is the saved model with the refined parameters providing the best accuracy. The accuracies of the original network (start accuracy) and the proposed approach (end accuracy) are presented in Table 1 for different early stopping criteria, more specifically, when =1 and =5. As can be seen, the proposed approach provides an increase in accuracy by autonomously and simultaneously refining nine parameters of the network in a systematic way.
|Number of epochs (c) for early stopping||1||5|
3.2 Experiments with UCI HAR Dataset and an LSTM-based network
UCI HAR dataset  is composed of Inertial Measurement Unit (IMU) data captured during activities of standing, sitting, laying, walking, walking upstairs and walking downstairs. These activities were performed by 30 subjects, and the 3-axial linear acceleration and 3-axial angular velocity were collected at a constant rate of 50Hz.
In this case, the network model shown in Fig. 7 is used for evaluators. As can be seen, this network is an LSTM model. The output of each generator is a 9-dimensional vector which is composed of different parameter settings. More specifically, first four of the parameters are the number of neurons for two fully connected layers and two LSTM layers. Next four of the parameters indicate the choice of activation function for fully connected and two LSTM layers from (‘Sigmoid’, ‘Relu’, ‘Linear’, ‘Tanh’) functions. Last of the nine parameters is a flag indicating whether to add a dropout layer between fully connected layers. In this case, and are set to be: and , respectively.
The accuracy over number of epochs is shown in Fig. 9. The blue and red lines show the accuracies for Generator 1 and Generator 2, respectively. Green line is the saved model with the refined parameters providing the best accuracy. The accuracies of the original network (baseline accuracy) and the proposed approach are presented in the second row of Table 2. As can be seen, the proposed approach provides an increase in accuracy for this LSTM network and this IMU dataset as well.
|Dataset||Baseline Accuracy||Accuracy of the Proposed Method|
|Words built from Chars74k||85.5%||86.64%|
3.3 Experiments with Chars74k Dataset
We also tested our proposed approach with a word recognition method , which uses the characters from the Chars74k dataset  to build words. Chars74k dataset contains 64 classes (0-9, A-Z, a-z), 7705 characters obtained from natural images, 3410 hand-drawn characters using a tablet PC and 62992 synthesised characters from computer fonts giving a total of over 74K images. Some example words built from these characters are shown in Fig. 10.
The work in  uses the network model shown in Fig. 7 for character recognition, and then employs belief propagation for word recognition. In our experiments, we used the same network model in Fig. 7 for our evaluators, and then performed the word recognition the same way to compare the word recognition accuracies. For the generators, the output is a 7-dimensional vector which is composed of different parameter settings. More specifically, first two of the parameters are the number of neurons for two fully connected layers. Next four of the parameters indicate the choice of activation function for fully connected and convolutional layers from (‘Sigmoid’, ‘Relu’, ‘Linear’, ‘Tanh’) functions. The last of the seven parameters is a flag indicating whether to add a dropout layer between fully connected layers. In this case, and are set to be: and , respectively.
The word recognition accuracies obtained by using the original network  (baseline accuracy) and the proposed approach are presented in the last row of Table 2. As can be seen, the proposed approach consistently provides an increase in accuracy for different types of networks and different datasets.
In Table 3, we present the parameters used in the original networks, and the parameters that were refined and optimized by the proposed method for all three different scenarios.
|Dataset||Baseline Parameters||Parameters Chosen by the Proposed Method|
|Words built from Chars74k||[1024,256,1,1,1,1,0]||[2804,2121,1,1,1,1,1]|
In this paper, we have presented a novel and systematic method that autonomously and simultaneously optimizes multiple parameters of any given deep neural network by using a GAN-based approach. The set of parameters can include the number of neurons, the type of activation function, the choice of using drop out and so on. In our proposed approach, two different models compete and improve each other progressively with a GAN-based strategy. This approach can be used to refine parameters of different network architectures. Without loss of generality, the proposed method has been tested with three different neural network architectures, and three different datasets. The results show that the presented approach can simultaneously and successfully optimize multiple neural network parameters, and achieve increased accuracy in all three scenarios.
-  Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. (2012) 1097–1105
-  Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE (2009) 248–255
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
-  Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European conference on computer vision, Springer (2014) 818–833
-  Lin, M., Chen, Q., Yan, S.: Network in network. arXiv preprint arXiv:1312.4400 (2013)
-  Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., et al.: Going deeper with convolutions, Cvpr (2015)
-  Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 2818–2826
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 770–778
-  Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. (2015) 91–99
-  Benardos, P., Vosniakos, G.C.: Optimizing feedforward artificial neural network architecture. Engineering Applications of Artificial Intelligence 20(3) (2007) 365–382
-  Islam, B.U., Baharudin, Z., Raza, M.Q., Nallagownden, P.: Optimization of neural network architecture using genetic algorithm for load forecasting. In: Intelligent and Advanced Systems (ICIAS), 2014 5th International Conference on, IEEE (2014) 1–6
-  Stanley, K.O., Miikkulainen, R.: Evolving neural networks through augmenting topologies. Evolutionary computation 10(2) (2002) 99–127
-  Miikkulainen, R., Liang, J., Meyerson, E., Rawal, A., Fink, D., Francon, O., Raju, B., Navruzyan, A., Duffy, N., Hodjat, B.: Evolving deep neural networks. arXiv preprint arXiv:1703.00548 (2017)
-  Ritchie, M.D., White, B.C., Parker, J.S., Hahn, L.W., Moore, J.H.: Optimizationof neural network architecture using genetic programming improvesdetection and modeling of gene-gene interactions in studies of humandiseases. BMC bioinformatics 4(1) (2003) 28
-  Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. Journal of Machine Learning Research 13(Feb) (2012) 281–305
-  Jin, J., Yan, Z., Fu, K., Jiang, N., Zhang, C.: Neural network architecture optimization through submodularity and supermodularity. arXiv preprint arXiv:1609.00074 (2016)
-  Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. (2014) 2672–2680
-  Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015)
-  Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
-  Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. arXiv preprint (2017)
-  Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593 (2017)
-  Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3d shapenets: A deep representation for volumetric shapes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 1912–1920
-  Anguita, D., Ghio, A., Oneto, L., Parra, X., Reyes-Ortiz, J.L.: A public domain dataset for human activity recognition using smartphones. In: ESANN. (2013)
-  Li, Y., Li, Z., Qiu, Q.: Assisting fuzzy offline handwriting recognition using recurrent belief propagation. In: Computational Intelligence (SSCI), 2016 IEEE Symposium Series on, IEEE (2016) 1–8
-  de Campos, T.E., Babu, B.R., Varma, M.: Character recognition in natural images. In: Proceedings of the International Conference on Computer Vision Theory and Applications, Lisbon, Portugal. (February 2009)