Projection Based Weight Normalization for Deep Neural Networks
Abstract
Optimizing deep neural networks (DNNs) often suffers from the illconditioned problem. We observe that the scalingbased weight space symmetry property in rectified nonlinear network will cause this negative effect. Therefore, we propose to constrain the incoming weights of each neuron to be unitnorm, which is formulated as an optimization problem over Oblique manifold. A simple yet efficient method referred to as projection based weight normalization (PBWN) is also developed to solve this problem. PBWN executes standard gradient updates, followed by projecting the updated weight back to Oblique manifold. This proposed method has the property of regularization and collaborates well with the commonly used batch normalization technique. We conduct comprehensive experiments on several widelyused image datasets including CIFAR10, CIFAR100, SVHN and ImageNet for supervised learning over the stateoftheart convolutional neural networks, such as Inception, VGG and residual networks. The results show that our method is able to improve the performance of DNNs with different architectures consistently. We also apply our method to Ladder network for semisupervised learning on permutation invariant MNIST dataset, and our method outperforms the stateoftheart methods: we obtain test errors as , , and with only 20, 50, and 100 labeled samples, respectively.
1 Introduction
Deep neural networks have achieved great success across a broad range of domains, such as computer vision, speech processing and natural language processing [16, 46, 44]. While their deep and complex structure provides them powerful representation capacity and appealing advantages in learning feature hierarchies, it also makes the learning difficult. In the literatures, various heuristics and optimization algorithms have been studied, in order to improve the efficiency of the training, including weight initialization [24, 10, 15], normalization of internal activation [20], and sophistic optimization methods [13, 48]. Despite the progress, training deep neural networks and ensuring satisfactory performance is still considerably an open problem, due to its nonconvexity nature and the illconditioned problems.
Deep neural networks (DNNs) have a large number of local minima, due to the fact that they usually suffer model identifiability problem. A model is called to be identifiable if a sufficiently large training set can rule out all but one setting of the model’s parameters [11]. Neural networks are often not identifiable because we can obtain equivalent models by swapping their weights with each other, which is called weight space symmetry [6]. In addition, for the commonly used rectified nonlinear [29] or maxout network [12], we can also construct equivalent models by scaling the incoming weight of a neuron by a factor of while scaling its outgoing weight by . We refer to this as scalingbased weight space symmetry [31]. These issues imply that there can be an extremely large or even uncountably infinite amount of local minima for a neural network. Although it still remains an open question whether the difficulty of optimizing neural networks originates from local minima, we observe that the scalingbased weight space symmetry can cause the Hessian matrix illconditioned, which is deemed to the most prominent challenge in optimization [10, 38].
To alleviate the negative effect of scalingbased weight space symmetry, we propose to constrain the incoming weights of each neuron to be unitnorm. This simple strategy can ensure that the weight matrix in each layer has almost the same magnitude. Besides, it can keep the norm of backpropagation information during linear transformations. Training neural networks with such constraints can be formulated as an optimization problem over Oblique manifold [1]. To address this optimization problem, we propose a projection based weight normalization method to improve both performance and efficiency. Our method executes standard gradient updates, followed by projecting the updated weight back to Oblique manifold. We point out that the proposed method has the property of regularization as weight decay [23], and can be viewed as a regularization term with adaptive regularization factors. We further show that our method implicitly adjusts the learning rate and ensures the unitnorm characteristic for incoming weight of each neuron, under the condition that batch normalization [20] is employed in the networks.
We conduct comprehensive experiments on several widelyused image datasets including CIFAR10, CIFAR100 [22], SVHN [30] and ImageNet [8] for supervised learning over the stateoftheart Convolutional Neural Networks (CNNs), such as Inception [43], VGG [39] and residual network [16, 49]. The experimental results show that our method can improve the performance of deep neural networks with different architectures without revising any experimental setups. We also consider semisupervised learning for permutation invariant MNIST dataset by applying our method to Ladder network [35]. Our method outperforms the stateoftheart results in this task: we achieve test errors as , , and with only 20, 50, and 100 labeled training samples, respectively. Code to reproduce our experimental results is available on: https://github.com/huangleiBuaa/NormProjection. Our contributions are as below.

We propose to optimize neural networks over Oblique manifold, which can alleviate the illconditioned problem caused by scalingbased weight space symmetry.

We propose projection based weight normalization method (PBWN), which serves as a simple, yet effective and efficient solution to optimization over Oblique manifold in DNNs. We further analyze that PBWN has the property of regularization as weight decay, and also collaborates well with commonly used batch normalization technique.

We apply PBWN to the stateoftheart CNNs over large scale datasets, and improve the performance of networks with different architectures without revising any experimental setups. Besides, the additional computation cost introduced by PBWN is negligible.
2 Optimization over Oblique Manifold in DNNs
Consider a learning problem with training data using a feedforward neural network with layers, where refers to the input and the corresponding target. The network is parameterized by a set of weights and biases , in which each layer is composed of a linear transformation and an elementwise nonlinearity: . In this paper, we mainly focus on rectifier activation function that has a property of , and drop the biases for simplifying discussion and description.
Given a loss function that measures the mismatch between the desired output and the predicted output , we can train a neural network by minimizing the empirical loss as follows:
(1) 
In the above formulation, gradient information dominates how to tuning the network parameters. The weight updating rule of each layer for one iteration is usually designed based on Stochastic Gradient Descent (SGD):
(2) 
where is the learning rate and the gradient of the loss function with respect to the parameters is approximated by the minibatch of size by computing .
2.1 ScalingBased Weight Space Symmetry
In this part, we will show why the scalingbased weight space symmetry can cause the Hessian matrix illconditioned, and this behaviour makes training deep neural network more challenging.
We consider a very simple twolayer linear model with only one neuron per layer, and abuse the rectified nonlinear layer for simplifying discussion without loss of generalization. Let and for the two layers, and define the loss function . We further assume and are in the same magnitude. Based on the scalingbased weight space symmetry, we consider another twolayer linear model parameterized by and where . Under this parameterization, we can still have the same model output as for the same input .
For these two models, we can get the backpropagated gradient information and , and further have due to the fact . Based on simple algebra derivation, it is easy to obtain that and . This phenomenon implies that if and are in different magnitude, their gradient information and will be inversely different in terms of magnitude. Subsequently, as becomes larger, it is more likely that the Hessian matrix will be illconditioned, as shown in Figure 1.
2.2 Formulation for UnitNorm Constraint
To relieve the negative effect of scalingbased weight space symmetry, in this paper we propose to constrain the incoming weights of each neuron^{1}^{1}1We can also constrain the outgoing weights to be unitnorm. However, it seems more intuitive with filters being unitnorm. to be unitnorm. Specifically, we reformulate the optimization problem of Eqn. 1 as follows:
(3) 
where denotes an operation that extracts the diagonal elements of matrix and sets the offdiagonal elements as 0. We drop the index of for simplifying denotation. Indeed, the constraint of the weight matrix in each layer defines a embedded submanifold of called the Oblique manifold [1]:
(4) 
Note that here we adopt to denote the set of all matrices with normalized rows, which is different from the standard denotation with normalized columns [1, 3].
First, we can apply Riemannian optimization method [3] to solve Problem 2.2. We calculate the Riemannian gradient in the tangent space of at current point by:
(5) 
where is the ordinary gradient.
Given Riemannian gradient, we update the weight along the negative Riemannian gradient with in the tangent space, where is the learning rate. We then use a retraction as suggested by [1] that maps the tangent vectors to the points on the manifolds as:
(6) 
where . Therefore, we can get the new point in the Oblique manifold as: . We update the weight matrices iteratively until convergence.
3 Projection Based Normalization
The Riemannian optimization method provides a good solution to Problem 2.2. However, it also introduces extra nonignorable computational cost. For instance, we have to calculate the Riemannian gradient by subtracting an extra term and then project the weight in the tangent space back to the Oblique manifold by multiplying in each iteration. Is it possible to reduce the computational cost without performance loss and meanwhile guarantee the solution satisfying the unitnorm constraints?
To make the following analysis more clear, let us first consider one neuron with its incoming weight satisfying the unitnorm constraint . Based on Eqn. 5, its Riemannian gradient can be obtained as follows:
(7) 
From Eqn. 7, we can find that the Riemannian gradient actually adjusts the ordinary gradient by subtracting an extra term . Besides, we have the following fact
(8)  
which means that is not a dominant term compared to in Eqn. 7. We also observe this fact in our experiments. Therefore, we recommend simply using the ordinary gradient to solve Problem 2.2 with much less computation cost as follows:
(9)  
(10) 
Here, Eqn. 10 works by projecting the updated weight back to the Oblique manifold, and we thus call this operation norm projection. Indeed, the operation combining Eqn. 9 and 10 is equivalent to the retractor operation in Eqn. 5, when given the Riemannian gradient .
Note that if the weight updating is based on the ordinary gradient in Eqn. 10, the norm projection operation can not make the updating go along the negative gradient direction, and subsequently disturbs the gradient information. We find that such a disturbance eventually does not harm the learning as shown in Figure 2 (a). From it, we observe that using the ordinary gradient has nearly identical training loss curve to using Riemannian gradient.
For more efficient computation, we can also execute the norm projection operation of Eqn. 10 by an interval rather than in each iteration. We empirically find this trick can work well in practice. It should be pointed out that when executing norm projection operation with a large , our method may lose some information learned in the weight matrix and also suffer instability after the norm projection as shown in Figure 2 (b). From it, we can find that in the initial phase, executing norm projection by large interval results in the sudden increase of loss. This is mainly because we change the scale of each filter, which results in the predictions different for the same input. Fortunately, we can remedy this issue by combing with batch normalization [20]. We will discuss it in the next subsection.
To summarize, we show our projection based weight normalization framework in Algorithm 1, in which an extra norm projection is executed by interval. Note that the proposed Riemannian optimization over Oblique manifold described before can be viewed as a specific instance of our framework, under the conditions that we use Riemannian gradient, steepest gradient descent and interval .
3.1 Combined with Batch Normalization
Batch normalization is a popular technique that stabilizes the distribution of activations in each layer and thus accelerates the convergence. It works by normalizing the preactivation of each neuron to zeromean and unitvariance over each minibatch, and an extra learnable scale and bias parameters are recommended to restore the representation power of the networks. Specifically, for each neuron, batch normalization has a formulation as follows:
(11) 
One interesting property of batch normalization is that the incoming weight of each neuron is scaling invariant, that is
(12) 
The norm projection operation of Eqn. 10 can be viewed as a scaling of . Therefore, when combined with batch normalization, the norm projection also can keep the same output during training in a rectifier network, that is . Therefore, we can ensure that norm projection does not drop any learned information in the weight matrix, even thought we execute norm projection outside the gradient descent steps.
Another interesting point is that norm projection eventually affects the backpropagation information when combined with batch normalization. Batch normalization owns a property of
(13) 
Therefore, we can get
(14) 
This indicates that the norm projection operation implicitly adjusts the learning rate by a factor of .
To summarize, when combined with batch normalization in a rectifier network, the norm projection operation enjoys the following characteristics: (1) guaranteeing that the incoming weight is unitnorm; (2) keeping the output same as before the operation during the training; (3) implicitly adjusting the learning rate by a factor of . These characteristics make our projection based weight normalization have stable optimization process.
3.2 Connecting to Weight Decay
We find that our projection based weight normalization has strong connections to weight decay [23]. Weight decay [23] is a simple yet effective technique to regularize the neural networks. The update formulation of weight decay is:
(15) 
where is a constant weight decay factor. Indeed, weight decay can be considered as a solution to the loss function appended with a regularization term . From this perspective, we can treat weight decay as a soft constraint and while our method a hard constraint with each neuron’s incoming weight .
From another perspective, we can get the weight updating formulation of our method based on Eqn. 9 and 10:
(16) 
where . We can find that Eqn. 16 has a similar weight updating form as weight decay. Particularly, we have a weightspecific decay rate and also a weightspecific learning rate. Therefore, the solution to optimization over Oblique manifold can be viewed as a regularization method with adaptive regularization factors. Eventually, the weight matrix in only has free degree of .
3.3 Computational Cost
Let’s consider a standard linear layer: with and a minibatch input data of size . For each iteration, the computational cost of the standard linear layer (i.e., calculating and ) is FLOPs. The extra cost for Riemannaian optimization is FLOPs. When using our norm projection with ordinary gradient, the extra cost is FLOPs. Particularly, if we use interval , the extra cost is FLOPs. We can find that the computational cost of norm projection with interval update is negligible to that of the standard linear layer.
For a convolution layer with filters , where and respectively indicate the height and width of the filter, we perform norm propagation over the unrolled . Assuming the input feature map with size , the cost of the convolution layer is FLOPs. Norm projection with interval updating has an extra cost of FLOPs, which is also exactly negligible, compared to the convolution operation.
Methods  CIFAR10  CIFAR100 

Normal  6.48 0.14  25.71 0.15 
WN  6.20 0.07  24.22 0.53 
PBWNRiem (ours)  5.33 0.19  22.46 0.25 
PBWN (ours)  5.22 0.05  22.70 0.65 
PBWNEpoch (ours)  5.46 0.22  22.83 0.87 
Methods  CIFAR10  CIFAR100 

Normal  7.23 0.29  27.80 0.31 
WN  7.40 0.21  29.86 0.38 
PBWNRiem (ours)  6.23 0.10  27.49 0.35 
PBWN (ours)  6.31 0.11  27.33 0.21 
PBWNEpoch (ours)  6.27 0.11  26.91 0.25 
4 Experiments
In this section, we first conduct extensive experiments for supervised learning on four widelyused image datasets, i.e., CIFAR10, CIFAR100, SVHN and ImageNet, and investigate the performance over various types of CNNs. We also consider semisupervised learning tasks for permutation invariant MNIST dataset by using Ladder network [35]. For all experiments, we adopt random weight initialization by default as described in [24], unless we specify the weight initialization methods.
Res20  Res32  Res44  Res56  Res110  

BaseLine*  8.75  7.51  7.17  6.97  6.61 0.16 
BaseLine  7.94 0.16  7.70 0.26  7.17 0.25  7.21 0.25  7.09 0.24 
WN  8.12 0.18  7.25 0.14  6.86 0.06  7.01 0.52  7.56 1.11 
PBWNRiem (ours)  8.03 0.17  7.18 0.18  6.69 0.15  6.42 0.25  6.68 0.31 
PBWN (ours)  8.08 0.07  7.09 0.18  6.89 0.17  6.48 0.17  6.27 0.34 
PBWNEpoch (ours)  7.86 0.25  6.99 0.27  6.59 0.17  6.41 0.13  6.39 0.45 
Res20  Res32  Res44  Res56  Res110  

BaseLine  32.28 0.16  30.62 0.35  29.95 0.66  29.07 0.40  28.79 0.63 
WN  31.90 0.45  30.63 0.37  29.57 0.29  29.16 0.45  28.38 0.99 
PBWNRiem (ours)  31.81 0.28  30.12 0.36  29.15 0.18  28.13 0.49  27.03 0.33 
PBWN (ours)  31.99 0.14  30.21 0.20  29.04 0.43  28.23 0.31  27.16 0.57 
PBWNEpoch (ours)  31.61 0.40  29.85 0.17  28.83 0.09  28.17 0.24  27.15 0.58 
4.1 The StateoftheArt CNNs
In the following part, we evaluated our method on CIFAR (both CIFAR10 and CIFAR100) datasets over the stateoftheart CNNs, including Inception [43], VGG [39] and residual network [16, 49]. CIFAR10 consists of 50,000 training images and 10,000 test images from 10 classes, while CIFAR100 from 100 classes. Each input image consists of pixels. The dataset was preprocessed as described in [16] by subtracting the means and dividing the variance for each channel. We follow the simple data augmentation that 4 pixels are padded on each side, and a 32 32 crop is randomly sampled from the padded image or its horizontal flip as described in [16].
We refer to the original networks as ‘Normal’. For our projection based weight normalization methods, we evaluate three setups as follows: (1) ‘PBWNRiem’: performing norm projection for each iteration based on Riemannian gradients; (2) ‘PBWN’: performing norm projection for each iteration based on ordinary gradients; (3) ‘PBWNEpoch’: performing norm projection for each epoch based on ordinary gradients. We also choose another very related work named Weight Normalization [38] (referred to as ‘WN’) as one baseline.
4.1.1 Inception Architecture
We first evaluate our method on Inception architecture [43] equipped with batch normalization (BN), inserted after each convolution layer. All the models are trained by SGD with a minibatch size of 64, considering the memory constraints on one GPU. We adopt a momentum of 0.9 and weight decay of 0.0005. Regarding the learning rate annealing, we start with a learning rate of 0.1, divide it by 5 at 50, 80 and 100 epochs, and terminate the training at 120 epochs empirically. The results are also obtained by averaging over five random seeds. Figure 3 (a) and (b) show the training error with respect to epochs on CIFAR10 and CIFAR100 dataset respectively, and Table 1 lists the test errors. From Figure 3, we observe that our model can converge significantly faster than the baselines. Particularly, ‘PBWNRiem’ and ‘PBWN’ have nearly identical training curves, which means that there is no need to calculate the Reimannian gradient when performing norm projection in Inception network with BN. The test performance in Table 1 further demonstrates that our methods also can achieve significant improvements over the baselines, mainly owing to their desirable regularization ability.
4.1.2 VGG Architecture
We further investigate the performance on the VGGE architecture [39] with global average pooling and batch normalization inserted after each convolution layer. We initialize the model with HeInit [15]. The models are again trained by SGD with a minibatch size of 128, the momentum of 0.9 and weight decay of 0.0005. Here, we start with a learning rate of 0.1, divide it by 5 at 80 and 120 epochs, and terminate the training at 160 epochs empirically. The averaged test errors after training are shown in Table 2, from which we can easily get the same conclusion as Inception architecture that our model can significantly boost the test performance of the baselines.
4.1.3 Residual Network
In this experiment, we further apply our method on famous residual network architecture [16]. We follow the exactly same experimental protocol as described in [16] and adopt the publicly available Torch implementation^{2}^{2}2https://github.com/facebook/fb.resnet.torch for residual network. Table 3 and 4 respectively show all the results of different methods on CIFAR10 and CIFAR100, using the residual network architecture with varied depths . We can find that our methods consistently achieve better performance when using different depths. Especially, with the depth increasing, our methods obtain more performance gains. Besides, we observe that there is no significant difference among the performance of different norm projection methods, when using different gradient information or updating intervals. Indeed, ‘PBWNEpoch’ works the best for most cases. This further indicates the effectiveness of our efficient model by executing norm projection by interval, meanwhile without performance degeneration.
4.1.4 Efficiency Analysis
We also investigate the wall clock times of training above networks, including Inception, VGG and 110 layer residual network. The experiment is implemented based on Torch and conducted on one Tesla K80 GPU. From the results reported in Table 5, we can find that our ‘PBWNepoch’ costs almost the same time as ‘Normal’ on all architectures, which means that it does not introduce extra time cost in practice as we analyzed in previous sections. ‘PBWN’ also requires little extra time cost, while ‘PBWNRiem’ needs nonignorable extra time. The results show that the norm projection solution can faithfully improve the efficiency of the optimization with unitnorm constraints and meanwhile achieve satisfying performance.
4.2 LargeScale Classification Task
SVHN dataset
To comprehensively study the performance of the proposed method, we consider a larger datasets SVHN [30] for digit recognition. SVHN consists of color images of house numbers collected by Google Street View. It includes 73,257 train images and 26,032 test images. Besides, we further appended the extra augmented 531,131 images into the training set. The experiment is based on wide residual network that achieves the stateoftheart results on this dataset. We use the WRN164 as [49] does, and follow the experimental setting provided in [49]: (1) The input images are divided by 255 to ensure them in [0,1] range; (2) During the training, SGD is used with momentum of 0.9 and dampening to 0, weight decay of 0.0005 and minibatch size of 128. The initial learning rate is set to 0.01 and dropped at 80 and 120 epochs by 0.1, until the total 160 epochs complete. Dropout is set to 0.4. Here, we only apply our method ‘PBWNEpoch’ on this WRN164 architecture, namely, we execute norm projection per epoch considering the time cost for such a large dataset. The results are shown in Table 6 comparing several stateoftheart methods in the literature. It can be easily to see that WRN achieves the best performance compared to other baselines, and our method can further improves WRN by simply executing the efficient norm projection operation for each epoch.
Methods  Inception  VGG  Res110 

Normal  20.96  4.20  5.96 
WN  23.33  5.27  6.42 
PBWNRiem  23.92  5.01  7.49 
PBWN  21.21  4.23  6.29 
PBWNEpoch  20.97  4.20  5.97 
Methods  test error 

DSN [25]  1.92 
RSD [18]  1.75 
GPF [25]  1.69 
WRN [49]  1.64 
WRN*  1.644( 0.046) 
WRNPBWNEpoch  1.607( 0.005) 
Residual  PreResidual  

method  Top1  Top5  Top1  Top5 
Normal  28.62  9.69  28.81  9.78 
PBWNEpoch  27.88  9.23  28.2  9.45 
method  Test error() for a given number of labeled samples  

20  50  100  
CatGAN [40] 
    1.91 0.1 
Skip Deep Generative Model [27]      1.32 0.07 
Auxiliary Deep Generative Model[27]      0.96 0.02 
Virtual Adversarial [28]      1.36 
Ladder [35]    1.62 0.65  1.06 0.37 
Ladder+AMLP [34]      1.002 0.038 
GAN with feature matching [37]  16.77 4.52  2.21 1.36  0.93 0.065 
TripleGAN [26]  4.81 4.95  1.56 0.72  0.91 0.58 
Ladder* (our implementation)  9.67 10.1  3.53 6.6  1.12 0.59 
Ladder+PBWN (ours)  2.52 2.42  1.06 0.48  0.91 0.05 
ImageNet 2012
To further validate the effectiveness of our method on largescale dataset, we employ ImageNet 2012 consisting of 1,000 classes [8]. We train the models on the given official 1.28M training images, and evaluated on the validation set with 50k images. We evaluate the classification performance based on top1 and top5 error. Note that in this part, we mainly focus on whether our proposed method is able to handle diverse and largescale datasets and provide a relative benefit for the conventional architecture, rather than achieving the stateoftheart results. We use the 34 layers residual network [16] and its preactivation version [17] to perform the classification task. The stochastic gradient descent is again applied with a minibatch size of 64, a momentum of 0.9 and a weight decay of 0.0001. We use exponential decay to of the initial learning rate until the end of 50 training epochs. We run with the initial learning rate of and select the best results shown in Table 7. We can find that ‘PBWNEpoch’ achieves lower test errors compared to the original residual network and preactivation residual networks.
4.3 Semisupervised Learning for Permutation Invariant MNIST
In this section, we applied our proposed method to semisupervised learning tasks on Ladder network [35] over the permutation invariant MNIST dataset. Three semisupervised classification tasks are considered respectively with 20, 50, 100 labeled examples. These labeled examples are sampled randomly with a balanced number for each class.
We reimplement Ladder network based on Torch, following the Theano implementation by [35]. Specifically, we adopt the setup as described in [35] and [34]: (1) the layer sizes of the model is 784100050025025025010; (2) the models are trained by Adam optimization [21] respectively with minibatch size of 100 (the task of 100 labeled examples), 50 (the task of 50 labeled examples) and 20 (the task of 20 labeled examples); (3) all the models are trained for 50,000 iterations with the initial learning rate, followed by 25,000 iterations with a decaying linearly to 0. We execute simple hyperparameters search with learning rate in and weight decay in ^{3}^{3}3The detailed experimental configuration to reproduce our results in our codes available on: https://github.com/huangleiBuaa/NormProjection. In this case, all experiments are run with 10 random seeds.
In Table 8, we report the results of Ladder based on our implementation (denoted by Ladder*) and our ‘PBWN’ that performs norm projection in each iteration. From Table 8, we can see that our method significantly improves the performance of the original Ladder network and achieves new stateoftheart results in the tasks with 20, 50, and 100 labeled examples. Especially, with 20 labeled examples our method achieves test error. We conjecture that these appealing results of our method are mainly stemming from its well regularization ability.
5 Related Work and Discussion
There exist a number of methods that regularize neural networks by bounding the magnitude of weights. One commonly used method is weight decay [23], which can be considered as a solution to the loss function appended with a regularization term of squared L2norm of the weight vector. Maxnorm [41, 42] constrains the norm of the incoming weights at each hidden unit to be bounded by a constant. It can be viewed as a constrained optimization problem over a ball in the parameter space, while our method addresses the optimization problem over an Oblique manifold. Path normalization [31] follows the idea of maxnorm, but bounds the product of weights along a path from the input to output nodes, which can also be viewed as a regularizer as weight decay [23]. Weight normalization [38] decouples the length of each incoming weight vector from its directions. If the extra scaling parameter is not considered, weight normalization can be viewed as normalizing the incoming weight. However, it solves the problem via reparameterization and can not guarantee whether the conditioning of Hessian matrix over proxy parameter will be improved; while our method performs normalization via projection and optimization over the original parameter space, which ensures the improvement of conditioning of Hessian matrix as shown in Figure 1. We experimentally show that our method outperforms weight normalization [38], in terms of both the effectiveness and computation efficiency.
There are large amount of work introducing orthogonality to the weight matrix [4, 47, 9, 45, 14, 32, 19] in deep neural networks to address the gradient vanish and explosion problem. Solving the problem with such orthogonality constraint is usually limited to the hiddentohidden transformation in Recurrent neural networks [4, 47, 9, 45]. Some work also consider orthogonal weight matrix in feed forward neural networks [14, 32, 19], while their solutions introduce expensive computation costs.
Normalizing the activations [20, 5, 36] in deep neural networks have also been studied. Batch normalization [20] is a famous and effective technique to normalize the activations. It standardizes the preactivation of each neuron to zeromean and unitvariance over each minibatch. Layer normalization [5] computed the statics of zeromean and unitvariance over all the hidden units in the same layers, targeting at the scenario where the size of minibatch is limited. Division normalization [36] is proposed from a unified view of normalization, which includes batch and layer normalization as special cases. These methods focus on normalizing the activations and are data dependent normalization, while our method normalizing the weights and therefore is data independent normalization. Based on the fact that our method is orthogonal to these methods, we provide analysis and experimental results showing that our method can improve the performance of batch normalization by combining them together.
Concurrent to our work, Cho and Lee [7] propose to optimize over Grassmann manifold, aiming to improve the performance of neural networks equipped with batch normalization [20]. The differences between their work and our work are in two aspects: (1) they only use the traditional Riemannian optimization method (‘Riemannian gradient + exponential maps’ [2]) to solve the constraint optimization problem, which introduce nontrivial commutation cost; while we consider both Riemannian optimization method (‘Riemannian gradient+ retraction’ [1] ) and further proposed a more general and efficient projection based weight normalization framework, which introduces negligible extra computation cost; (2) [7] requires gradient clipping technique [33] to make optimization stable and also needs tailored revision for SGD with momentum. On the contrary, our method is more general without requiring any extra tailored revision, and it can also collaborate well with other techniques of training neural networks.
6 Conclusions
The scalingbased weight space symmetry can cause illconditioning problem when optimizing deep neural networks. In this paper, we propose to address the problem by constraining the incoming weights of each neuron to be unitnorm. We provide the projection based weight normalization method, which serves as a simple, yet effective and efficient solution to such a constrained optimization problem. Our extensive experiments demonstrate that the proposed method greatly improves the performance of various stateoftheart network architectures over large scale datasets. We show that the projection based weight normalization offers a good direction for improving the performance of deep neural networks by alleviating the illconditioning problem.
References
 [1] P. A. Absil and K. A. Gallivan. Joint diagonalization on the oblique manifold for independent component analysis. In ICASSP, 2006.
 [2] P.A. Absil, R. Mahony, and R. Sepulchre. Riemannian geometry of Grassmann manifolds with a view on algorithmic computation. Acta Appl. Math., 80(2):199–220, 2004.
 [3] P.A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton, NJ, 2008.
 [4] Martín Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In ICML, 2016.
 [5] Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. CoRR, abs/1607.06450, 2016.
 [6] An Mei Chen, Haw minn Lu, and Robert HechtNielsen. On the geometry of feedforward neural network error surfaces. Neural Computation, 5(6):910–927, 1993.
 [7] Minhyung Cho and Jaehyung Lee. Riemannian approach to batch normalization. CoRR, abs/1709.09603, 2017.
 [8] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR, 2009.
 [9] Victor Dorobantu, Per Andre Stromhaug, and Jess Renteria. Dizzyrnn: Reparameterizing recurrent neural networks for normpreserving backpropagation. CoRR, abs/1612.04035, 2016.
 [10] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
 [11] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
 [12] Ian J. Goodfellow, David WardeFarley, Mehdi Mirza, Aaron C. Courville, and Yoshua Bengio. Maxout networks. In ICML, 2013.
 [13] Roger B. Grosse and Ruslan Salakhutdinov. Scaling up natural gradient by sparsely factorizing the inverse fisher matrix. In ICML, 2015.
 [14] Mehrtash Harandi and Basura Fernando. Generalized backpropagation, etude de cas: Orthogonality. In arxiv, 2017.
 [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In ICCV, 2015.
 [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. CoRR, abs/1603.05027, 2016.
 [18] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth. In ECCV, pages 646–661, 2016.
 [19] Lei Huang, Xianglong Liu, Bo Lang, Admas Wei Yu, and Bo Li. Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks. CoRR, abs/1709.06079, 2017.
 [20] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 [21] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [22] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
 [23] Anders Krogh and John A. Hertz. A simple weight decay can improve generalization. In NIPS. 1992.
 [24] Yann LeCun, Léon Bottou, Genevieve B. Orr, and KlausRobert Müller. Effiicient backprop. In Neural Networks: Tricks of the Trade, 1998.
 [25] ChenYu Lee, Saining Xie, Patrick W. Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeplysupervised nets. In AISTATS, volume 38 of JMLR Proceedings. JMLR.org, 2015.
 [26] Chongxuan Li, Kun Xu, Jun Zhu, and Bo Zhang. Triple generative adversarial nets. CoRR, abs/1703.02291, 2017.
 [27] Lars Maale, Casper Kaae Snderby, Sren Kaae Snderby, and Ole Winther. Auxiliary deep generative models. In ICML, 2016.
 [28] Takeru Miyato, Shinichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semisupervised learning. CoRR, abs/1704.03976, 2017.
 [29] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
 [30] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop, 2011.
 [31] Behnam Neyshabur, Ruslan Salakhutdinov, and Nathan Srebro. Pathsgd: Pathnormalized optimization in deep neural networks. In NIPS, 2015.
 [32] Mete Ozay and Takayuki Okatani. Optimization on submanifolds of convolution kernels in cnns. CoRR, abs/1610.07008, 2016.
 [33] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In ICML, 2013.
 [34] Mohammad Pezeshki, Linxi Fan, Philemon Brakel, Aaron C. Courville, and Yoshua Bengio. Deconstructing the ladder network architecture. In ICML, 2016.
 [35] Antti Rasmus, Harri Valpola, Mikko Honkala, Mathias Berglund, and Tapani Raiko. Semisupervised learning with ladder networks. In NIPS, 2015.
 [36] Mengye Ren, Renjie Liao, Raquel Urtasun, Fabian H. Sinz, and Richard S. Zemel. Normalizing the normalizers: Comparing and extending network normalization schemes. In ICLR, 2017.
 [37] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In NIPS, pages 2226–2234, 2016.
 [38] Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In NIPS, 2016.
 [39] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 [40] Jost Tobias Springenberg. Unsupervised and semisupervised learning with categorical generative adversarial networks. In ICLR, 2016.
 [41] Nathan Srebro and Adi Shraibman. Rank, tracenorm and maxnorm. In COLT, 2005.
 [42] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, January 2014.
 [43] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
 [44] Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, and Hang Li. Neural machine translation with reconstruction. In AAAI, 2017.
 [45] Eugene Vorontsov, Chiheb Trabelsi, Samuel Kadoury, and Chris Pal. On orthogonality and learning recurrent networks with long term dependencies. In ICML, 2017.
 [46] Simon Wiesler, Alexander Richard, Ralf Schlüter, and Hermann Ney. Meannormalized stochastic gradient for largescale deep learning. In ICASSP, 2014.
 [47] Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. Fullcapacity unitary recurrent neural networks. In NIPS, pages 4880–4888. 2016.
 [48] Adams Wei Yu, Qihang Lin, Ruslan Salakhutdinov, and Jaime G. Carbonell. Normalized gradient with adaptive stepsize method for deep neural network training. CoRR, abs/1707.04822, 2017.
 [49] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.