Meta-Learning with Hessian Free Approach in Deep Neural Nets Training
Meta-learning is a promising method to achieve efficient training method towards deep neural net and has been attracting increases interests in recent years. But most of the current methods are still not capable to train complex neuron net model with long-time training process. In this paper, a novel second-order meta-optimizer, named Meta-learning with Hessian-Free(MLHF) approach, is proposed based on the Hessian Free approach as the framework. Two recurrent neural networks are established to generate the damping and the precondition matrix of this Hessian free framework. A series of techniques to meta-train the MLHF towards stable and reinforce the meta-training of this optimizer, including the gradient calculation of , and use experiment replay on . Numerical experiments on deep convolution neural nets, including CUDA-convnet and resnet18(v2), with datasets of cifar10 and ILSVRC2012, indicate that the MLHF shows good and continuous training performance during the whole long-time training process, i.e., both the rapid-decreasing early stage and the steadily-deceasing later stage, and so is a promising meta-learning framework towards elevating the training efficiency in real-world deep neural nets.
Meta-Learning with Hessian Free Approach in Deep Neural Nets Training
Boyu Chen Fudan University email@example.com Wenlian Lu Fudan University firstname.lastname@example.org
noticebox[b]Preprint. Work in progress.\end@float
The meta-learning in optimizing neural network, usually named as the learning-to-learn, has been attracting increasingly interests of the researchers of deep learning in the last few years [3, 8, 38, 21, 22, 30, 36, 10]. In comparison to the hand-crafted optimizers, for instance Stochastic Gradient Descent (SGD) as well as its variants, including ADAM , RMSprop , the methodology of meta-learning is to employ trained meta-optimizer, usually via recurrent neural networks (RNN), to infer descent directions, used to train the underlying neural networks, towards better learning performance. This methodology is promising, because it has been widely believed that neural network can “learn” a “more effective” descent direction than the existing ones.
A meta-learning method is generally twofold. One is a well-defined neural network that outputs the “learned” descent direction and can be heuristic, and a decomposition mechanism, also known as framework, to largely reduce the number of meta-parameters of the meta-optimizer and enhances its generality, i.e., the trained meta-optimizer can work for at least a type of neural net learning tasks. The major frameworks in the latest few years include coordinatewise framework  and hierarchical framework  via RNN. However, most of the current meta-learning methods can only work for simple back-propagation (BP) model with short-time training process, because they all shows unstable when training a large scale of deep neural net. Hence, developing an efficient meta-optimizer and a good framework that is stable with acceptable computing cost, is still a challenge towards utilization of meta-learning to practical deep networks.
In this paper, we propose a novel second-order meta-optimizer, which utilizes the Hessian-Free method  as the framework. Specifically, the contribution and novelty of this paper include:
We realize the well-known the Hessian-Free method in meta-learning;
We improve the learning-to-learn losses of the recurrent neural networks of the meta-optimizer and utilize the experimental relay process in the meta-training;
Meta-learning has a long history as long as the development of neural net itself. The early exploration was done by Schmidhuber  in 1980’s. Afterwards, based on this idea, a lot of works appeared to proposed diverse learning algorithms, for instance [34, 29, 16]. At the same time, Bengio et al. [7, 5, 6] introduced learning locally parameterized rules instead of back-propagation. In the very recent years, the framework of coordinatewise RNN proposed by Andrychowicz et al.  illuminated a promising orient towards a meta-learned optimizer can be employed to diverse neural network architectures.
The power of the framework of coordinatewise RNN inspired the development of meta-learning. Andrychowicz et al.  also employed Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) with the reverse of Hessian matrix regarded as the the memory, and coordinatewise RNN as the controller of an Neural Turing Machine. However, storage of the reverse of Hessian matrix requires memory which is impossible to in a large-scale neural net. Li and Malik  proposed an similar approach at the same time, but the training algorithm of RNN in meta-optimizer realized by reinforcement learning. Ravi and Larochelle  profiled the method of  to few-shot learning tasks by using test error to train the meta-optimizer and applied for few-shot tasks. Chen et al.  utilized RNN to output the queue point of Bayesian optimization to train the neural net, instead of outputting descent directions. Finn et al.  proposed the Model-Agnostic Meta-Learning method by proposing a new parameter initialization strategy to enhance generalization of meta-learning method.
In contrast, Wichrowska et al.  addressed the problems of stabilization and generalization of . They proposed a hierarchical architecture framework than the coordinatewise framework. For the first time, they proposed the method of learning-to-learn that has been applied to train large-scale deep neural nets like Inception v3 and ResNet v2 on ILSVRC2012 with big datasets. However, the performance is not very ideal.
We consider a neural net formulated by , where stands for the input training data, for all parameters and stands for the output of the neural net. Let be the labels of the training data. The learning process of the neural net is to minimize certain loss denoted by . We also denote by without unambiguity.
2.1 Natural Gradient
The gradient descent can be regarded as the direction in the tangent space of the parameter that decreases the loss function at most. The well-know first-order gradient is the fastest direction with respect to the Euclidean metric, and the basis of the most of gradient descent algorithms in practice, for instance, the SGD, Adam and the others involved with momentums .
However, as argued by Amari , the metric of the parameter’s tangent space in fact assumes that all parameters have the same weight in metric but does not take the characteristics of the neural net into considerations. In addition, this metric does not possess the parameter invariant property [23, 2]. To conquer this issue, natural gradient of neural network was developed by Amari . one of a general definition is
where the metric is defined as . Assuming (1). , for all ; (2). , for all and ; (3). is differentiable with respect to and which is true for the mean square loss and the cross-entropy loss, the metric has the following expansion
where is the Jacobian matrix of with respect to and is the Hessian matrix of with respect to when . Hence, the natural gradient is specified as
where and is the normalization scalar. More specially, if is the cross entropy loss, then is the Fisher information matrix, which is in agreement with the original definition in Amari .
In many applications, natural gradient performs much better than the gradient descent. However, calculating the natural gradient in deep neural nets, has difficulty in practice, because calculating on a small mini-batch of the training data always causes of low ranks so that does not exist in nature. One alternative is to use the damping technique [26, 20]: let , where is a positive scalar. However, the selecting the proper value for the is difficult: if are too large, then natural gradient degenerates to the weighted gradient; if is too small, the natural gradient could be too aggressive due to the low rank of on a mini-batch of the training data.
2.2 Hessian Free Method in Neural Nets
Due to the arguments above, towards avoiding the calculating directly, the Hessian free method was proposed by Martens , Martens and Sutskever  to calculate nature gradient or other second-order gradient descent method in practice of deep neural nets. The key idea of Hessian free method is twofold: calculating and calculating .
First, to calculate , we are to calculate (1). , (2). , and then (3). . In a multiple-layered neuronal net, is computed by an iterative forward way. At layer , let be the map of layer with and the parameters and input of layer , and be the output of layer . We have the following iterative formula:
noting , where be the partial of associated with in , and in a BP layer, zero at the input layer, and also possibly has other formulas in other types of layer, for instance, the residual layer. Iteration of Equation (3) until the last output layer, i.e., , gives . In addition, is easy when is of low rank and is a backward process. Also, this approach can be applied to , where
Second, with an efficient calculation of , the natural gradient can be approximated by the conjugate gradient method . Algorithm 1 gives the pseudo-codes of the Preconditioned conjugate gradient (PCG), where is the the Preconditioned Matrix, which is positive definite and usually takes a diagonal matrix, and is the initial value. It should be highlighted that the choice of and effects a lot on the convergence speed in practice.
The Hessian free method to train a neural net usually needs about iterations of PCG for per training iteration of the neural net . Therefore, this method possesses much more computation coast and so does not own any advantage in terms of the wall clock time in comparison to the first-order gradient method, for instance, the SGD, in particular, when training deep neural networks.
3 Meta-Learning with Hessian Free approach
To conquer the disadvantage of the Hessian free method but still remaining the advantage of the natural gradient, in this section, we propose a novel method of employing the meta-learning approach to the Hessian Free method. We use a variant damping technique of , which let , where the vector parameter has all components nonnegative, i.e., for all , and is also be noted as damping parameters. This variant has stronger representation capability than the origin damping version ,i.e., . We generate the damping parameters and the preconditioned matrix by two RNNs , and respectively.
With the trained and , at each training step of the neural net , and infer the damping parameter and preconditioned matrix for the PCG algorithm (Algorithm 1). The PCG algorithm outputs the approximation of the natural gradient that gives the descent direction of . The specific pseudo-codes of this approach can be viewed in Algorithm 2.
The network structure of and are of the same coordinatewise framework as in . We consider six types of layer parameters: convolution kernels, convolution biases, fully-connection weights, full-connection biases, batchnorm’s gamma and batchnorm beta. For different layer parameters of the same type, RNNs share the same meta-parameters and conducted different inferences for different coordinations of the parameters of THIS layer; however, for the different types of layer parameters, they possess different meta-parameters of RNNs. In addition, we highlight that the learning rate is fixed as , because the scalar of the damping parameters is used to control the learning rate implicitly, and The initial value of PCG, , at the training iteration , takes the output values of of PCG at the last iteration .
3.1 Training meta-parameters of the and
We utilize the Back-Propagation-Through-Time (BPTT)  to meta-train and in parallel way, but with the different loss functions. Let be the iterative times of a sequence training process on target network in meta-training, the loss function of is
where , , are defined in Algorithm 2.111Another nature choice is to minimize the square of the norm of in PCG, which means , but it seems not as good as using formula (2), considering that has quite different scale and is hardly to be stable trained in the initial phase of meta-training. It can be seen that minimizing can enhance the preciseness of estimation of the natural gradient by a few iterations of PCG. The loss function of is defined as
Here is inspired by  with some modifications by adding the second item in formula (4). The motivation of this term comes from the challenge of meta-training that RNN has the tendency to predict the next input and to fit for it, but the mini-batch is indeed unpredictable in meta-training, which might cause overfitting or be hard to train at the early stage. Adding this item in (4) can reduce such influence and so stabilize meta-training process. Thus, is the softmax weighted average over all .
Stop gradient propagation
Another advantage of stoping back propagation of gradients of , , , is to simplify the gradient of multiplication in PCG iterations. In detail, For (without the damping part), the ’s gradient in back-propagation progress is not conducted. For the gradient of , we can get , that is the gradient operator of is it’s self. By this technique the calculation of the second-order gradient in meta-training is not necessarily any more, which also reduce GPU memory usage and simplify the calculation flow graph in practice.
Experiment replay of
During the meta-training, the inputs of one iterate in Algorithm 2 contains , , , and . is sampled from the dataset; and take values of zeros in practice. But for the , the common choice, random generating is not suitable, especially for the complex neural net such as ResNet. Here, we use the experiment replay technique[28, 32] to store and replay , as shown in Algorithm 3.
3.2 Analysis of Computation Complexity
The major time consumption of computation with the MLHF method is the forward (including the difference forward and the first common forward, while difference forward is much faster than the first common forwards considering that it can share intermediate result between forward in different times) and backward processes, other than the inference of the RNN. So the time complexity is , where is the max iterations in PCG and is the time that finish one forward and backward process. Here, we set , which usually cause times as long as the SGD for each iteration, which is illustrated in Section 4.
In experiments, we realize the MLHF method of Algorithm.2 by Tensorflow . Here, and are set as two-layered LSTM  with as the preprocess, and a linear map following softplus as the post-process with each layer units. In the meta-training process, the roll-back length of BPTT is set to . We use Adam as the optimizer for meta-training of RNNs, and the maximum number of iterations of PCG is fixed to by default if without specification.
In the first and second experiments, we evaluate the MLHF on a simple model(CUDA-convnet) and a more complex one(ResNet18(v2)) in contrast with other optimizers, including gradient-based first-order optimizers, i.e. RMSprop, adam, SGD + momentum(noted as SGD(m)) and practical second-order optimizer kfac [25, 13]. In particular, for ResNet18(v2), we do not compare the MLHF with the kfac because realzing kfac on ResNet18 (v2) is out of the limitation of GPU memory. All optimizer’s hyper-parameter will be keep to default in Tensorflow without specification. The results of these two experiments are illustrated by the loss function (cross-entropy of the neural net ) with respect to both the number of trained samples and wall time respectively. All the experiments were done on a single Nvidia GTX Titan Xp, and the code can be viewed in https://www.github.com/ozzzp/MLHF
In this work, we do not include other meta-optimizers, i.e. L2L , to comparison, because we can show our method is superior even in a simple MLP and L2L is failed to be trained with efficient descent of the loss functions on CUDA-convnet by us. See the supplementary materials for details.
4.1 Experiment 1: Convnet on Cifar10
CUDA-Convnet  is a simple CNN with convolution layers and fully connect layers. Here, we use the variant of CUDA-Convnet, which drops off the LRN layer and uses the fully connect layer instead of local connected layer on the top of the model. We meta-train a MLHF optimizer with batch size equal to by BPTT on cifar10  for epochs. After meta-training, we validate this meta-trained optimizer as well as the compared optimizers by training the same model on the same dataset with batch size of . Even though this model is quite simple, it has parameters, which is indeed more than the pervious models involved in learning-to-learn literature.
Figure.1 (a) and (b) shows the MLHF optimizer performs much better than kfac and RMSprop, Adam, and SGD(m) in both sample number and wall time.
4.2 Experiment 2: ResNet on ILSVRC2012
To validate the generalization of MLHF between different datasets and different-but-similar neural network architectures, we realize a mini version of ResNet  model on cifar10 for epochs, which has res-block with channel , to meta-training. Then we employ the meta-trained MLHF to train a ResNet18(v2) on ILSVRC2012 dataset. The batch size in meta-training is , and in training ILSVRC2012, due to the limitation of GPU memory.
As shown in Figure 2 (a), the performance of MLHF to train the Resnet (v2) on ILSVRC2012 is the best of all evaluated optimizers, in terms of both of the rapid-descent early stage and the steady-descent later stage, counting by the training sample number. However, Figure 2 (b) indicates that the SGD(m) method performed as good as the MLHF at the early stage but and iterating faster than MLHF in wall time. It has also been seen that the MLHF has effective descent progress of the loss function during the whole long-time training, which overcomes the major shortcoming of the previous meta-learning methods .
4.3 Experiment 3: Ablation experiment
In this experiment, to verify the efficiency of towards the natural gradient, we employ the same meta-training configuration as in section 4.2 and conduct the following four configuration and re-meta-training the MLHF for contrast: (1). remove and set the maximum iteration number of PCG to ; (2). remove and set the maximum iteration number of PCG to ; (3). keep but set the maximum iteration of PCG to .; (4). keep all to the default. We highlight that config (1) can be regarded as the best performance of PCG with a big cost of computation time.
From Figure 3 (a) and (b), one can have the following observations. First, with the help of , a very few () iterations of PCG (config 4) can estimate the natural gradient as precise as sufficient iterations of PCG (config 1) measured by (Figure 3 (a)); however, 4 iterations is far away from convergence of PCG, in contrast, iterations (config 1) can guarantee a good convergence of PCG, measured by the mean of (Figure 3 (b)). Second, in contrast, without , a few iterations of PCG (config 2) results in a bad estimation of natural gradient and of course far away from convergence of PCG. Finally, we highlight that iterations could be the optimal number for PCG with the help of , because further reduction of the number of iteration, i.e., 2 iterations of PCG (config 3), results in both a bad approximation of natural gradient and a bad convergence of PCG.
5 Conclusions and Discussions
In conclusion, we introduced a novel second-order meta-optimizer based on the Hessian Free approach. We utilized the PCG algorithm to approximate the natural gradient as the optimal descent direction for neural net training. By the coordinatewise framework, we designed and to infer the damping parameters and preconditioned matrix such that a very few number of iteration of PCG algorithm can achieve a good approximation of the natural gradient with an acceptably low computation cost. Furthermore, a few techniques were used to efficiently meta-train the MLHF. Then, experiments showed that this meta-optimizer can efficiently make progress during both the early and later stages of the whole long-time train process in a large-scale neuron nets with big datasets, including the CUDA-convnet on cifar10 and resnet18 (v2) on ILSVRC2012.
One explanation of this advantage of the MLHF is twofold. First, one can observe that however trained, only if works well in terms that approaches well, we have , which implies that under any over-fitting scenario of , the loss of decreases with a sufficiently small learning rate. Therefore, the the training process can be efficiently progressing even in the gradual stage. Second, it can be seen that each coordination of is determined by the whole and , which may result in a good error-tolerance.
To sum up, this advantage implies that the presented meta-optimizer can be a promising meta-learning framework towards elevating the training efficiency in practical deep neural nets.
The limitation of this work still exists on the cost of wall time in comparison to the first-order gradient method. As the increases of the number of neural net parameters, the wall time cost of the meta-optimizer increases in proportion, that will weaken the superiority of training efficiency in a very large-scale neural net, given the computation resource.
For the future work, we wish to evaluate MLHF’s generalization on more extensive neural networks, including RNN, RCNN, etc, and develop the distributed version of MLHF (It is a little sad for us that did not accomplish the experience on ResNet50, cause that without distributed version, the maximum batch size on a single Nvidia GTX Titan Xp is only 8 that is too small to train ILSVRC2012). The simplification and accelerating is also one of orients. We have great expectation of this orient that can make learning-to-learn approach exhibit it’s promised efficacy in deep neural networks.
- Abadi et al.  M. i. n. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
- Amari  S.-I. Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
- Andrychowicz et al.  M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
- Atkinson  K. E. Atkinson. An introduction to numerical analysis. John Wiley & Sons, 2008.
- Bengio et al.  S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, pages 6–8. Univ. of Texas, 1992.
- Bengio et al.  S. Bengio, Y. Bengio, and J. Cloutier. On the search for new learning rules for anns. Neural Processing Letters, 2(4):26–30, 1995.
- Bengio et al.  Y. Bengio, S. Bengio, and J. Cloutier. Learning a synaptic learning rule. Universit é de Montr é al, D é partement d’informatique et de recherche op é rationnelle, 1990.
- Chen et al.  Y. Chen, M. W. Hoffman, S. G. o. m. Colmenarejo, M. Denil, T. P. Lillicrap, M. Botvinick, and N. de Freitas. Learning to learn without gradient descent by gradient descent. arXiv preprint arXiv:1611.03824, 2016.
- Deng et al.  J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. Fei-Fei. Ilsvrc-2012, 2012. URL http://www. image-net. org/challenges/LSVRC, 2012.
- Finn et al.  C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
- Girshick  R. Girshick. Fast r-cnn. arXiv preprint arXiv:1504.08083, 2015.
- Graves et al.  A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
- Grosse and Martens  R. Grosse and J. Martens. A kronecker-factored approximate fisher matrix for convolution layers. In International Conference on Machine Learning, pages 573–582, 2016.
- He et al.  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Hestenes and Stiefel  M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems, volume 49. NBS Washington, DC, 1952.
- Hochreiter and Schmidhuber  S. Hochreiter and J. u. r. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Kingma and Ba  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Krizhevsky  A. Krizhevsky. cuda-convnet: High-performance c++/cuda implementation of convolutional neural networks. Source code available at https://github. com/akrizhevsky/cuda-convnet2 [March, 2017], 2012.
- Krizhevsky and Hinton  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
- LeCun et al.  Y. LeCun, L. e. o. Bottou, G. B. Orr, and K.-R. M ü ller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–50. Springer, 1998.
- Li and Malik  K. Li and J. Malik. Learning to optimize. arXiv preprint arXiv:1606.01885, 2016.
- Li and Malik  K. Li and J. Malik. Learning to optimize neural nets. arXiv preprint arXiv:1703.00441, 2017.
- Martens  J. Martens. Deep learning via hessian-free optimization. In ICML, volume 27, pages 735–742, 2010.
- Martens  J. Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.
- Martens and Grosse  J. Martens and R. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417, 2015.
- Martens and Sutskever  J. Martens and I. Sutskever. Learning recurrent neural networks with hessian-free optimization. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1033–1040. Citeseer, 2011.
- Martens and Sutskever  J. Martens and I. Sutskever. Training deep and recurrent networks with hessian-free optimization. In Neural networks: Tricks of the trade, pages 479–535. Springer, 2012.
- Mnih et al.  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Naik and Mammone  D. K. Naik and R. Mammone. Meta-neural networks that learn by learning. In Neural Networks, 1992. IJCNN., International Joint Conference on, volume 1, pages 437–442. IEEE, 1992.
- Ravi and Larochelle  S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. 2016.
- Rumelhart et al.  D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. nature, 323(6088):533, 1986.
- Schaul et al.  T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
- Schmidhuber  J. u. r. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-… hook. PhD thesis, Technische Universit ä t M ü nchen, 1987.
- Sutton  R. S. Sutton. Adapting bias by gradient descent: An incremental version of delta-bar-delta. In AAAI, pages 171–176, 1992.
- Tieleman and Hinton  T. Tieleman and G. Hinton. Lecture 6.5-rmsprop, coursera: Neural networks for machine learning. University of Toronto, Technical Report, 2012.
- Wang et al.  J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
- Werbos  P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
- Wichrowska et al.  O. Wichrowska, N. Maheswaranathan, M. W. Hoffman, S. G. Colmenarejo, M. Denil, N. de Freitas, and J. Sohl-Dickstein. Learned optimizers that scale and generalize. arXiv preprint arXiv:1703.04813, 2017.