Can We Gain More from Orthogonality Regularizations in Training Deep CNNs?
Abstract
This paper seeks to answer the question: as the (near) orthogonality of weights is found to be a favorable property for training deep convolutional neural networks, how can we enforce it in more effective and easytouse ways? We develop novel orthogonality regularizations on training deep CNNs, utilizing various advanced analytical tools such as mutual coherence and restricted isometry property. These plugandplay regularizations can be conveniently incorporated into training almost any CNN without extra hassle. We then benchmark their effects on stateoftheart models: ResNet, WideResNet, and ResNeXt, on several most popular computer vision datasets: CIFAR10, CIFAR100, SVHN and ImageNet. We observe consistent performance gains after applying those proposed regularizations, in terms of both the final accuracies achieved, and faster and more stable convergences. We have made our codes and pretrained models publicly available.
1 Introduction
Despite the tremendous success of deep convolutional neural networks (CNNs) krizhevsky2012imagenet (), their training remains to be notoriously difficult both theoretically and practically, especially for stateoftheart ultradeep CNNs. Potential reasons accounting for such difficulty lie in multiple folds, ranging from vanishing/exploding gradients glorot2010understanding (), to feature statistic shifts ioffe2015batch (), to the proliferation of saddle points dauphin2014identifying (), and so on. To address these issues, various solutions have been proposed to alleviate those issues, examples of which include parameter initialization saxe2013exact (), residual connections he2016deep (), normalization of internal activations ioffe2015batch (), and secondorder optimization algorithms dauphin2014identifying ().
This paper focuses on one type of structural regularizations: orthogonality, to be imposed on linear transformations between hidden layers of CNNs. The orthogonality implies energy preservation, which is extensively explored for filter banks in signal processing and guarantees that energy of activations will not be amplified zhou2006special (). Therefore, it can stabilize the distribution of activations over layers within CNNs rodriguez2016regularizing (); desjardins2015natural () and make optimization more efficient. saxe2013exact () advocates orthogonal initialization of weight matrices, and theoretically analyzes its effects on learning efficiency using deep linear networks. Practical results on image classification using orthogonal initialization are also presented in mishkin2015all (). More recently, a few works jia2016improving (); harandi2016generalized (); ozay2016optimization (); xie2017all (); huang2017orthogonal () look at (various forms of) enforcing orthogonality regularizations or constraints throughout training, as part of their specialized models for applications such as classification xie2017all () or person reidentification sun2017svdnet (). They observed encouraging result improvements. However, a dedicated and thorough examination on the effects of orthogonality for training stateoftheart general CNNs has been absent so far.
Even more importantly, how to evaluate and enforce orthogonality for nonsquare weight matrices does not have a sole optimal answer. As we will explain later, existing works employ the most obvious but not necessarily appropriate option. We will introduce a series of more sophisticated regularizers that lead to larger performance gains.
This paper investigates and pushes forward various ways to enforce orthogonality regularizations on training deep CNNs. Specifically, we introduce three novel regularization forms for orthogonality, ranging from the doublesided variant of standard Frobenius normbased regularizer, to utilizing Mutual Coherence (MC) and Restricted Isometry Property (RIP) tools candes2005decoding (); donoho2006compressed (); wang2016sparse (). Those orthogonality regularizations have a plugandplay nature, i.e., they can be incorporated with training almost any CNN without hassle. We extensively evaluate the proposed orthogonality regularizations on three stateoftheart CNNs: ResNet he2016deep (), ResNeXt xie2017aggregated (), and WideResNet zagoruyko2016wide (). In all experiments, we observe the consistent and remarkable accuracy boosts (e.g., 2.31% in CIFAR100 top1 accuracy for WideResNet), as well as faster and more stable convergences, without any other change made to the original models. It implies that many deep CNNs may have not been unleashed with their full powers yet, where orthogonality regularizations can help. Our experiments further reveal that larger performance gains can be attained by designing stronger forms of orthogonality regularizations. We find the RIPbased regularizer, which has better analytical grounds to characterize nearorthogonal systems zhang2011sparse (), to consistently outperform existing Frobenius normbased regularizers and others.
2 Related Work
To remedy unstable gradient and covariate shift problems, glorot2010understanding (); he2015delving () advocated near constant variances of each layer’s output for initialization. ioffe2015batch () presented a major breakthrough in stabilizing training, via ensuring each layer’s output to be identical distributions which reduce the internal covariate shift. salimans2016weight () further decoupled the norm of the weight vector from its phase(direction) while introducing independences between minibatch examples, resulting in a better optimization problem. Orthogonal weights have been widely explored in Recurrent Neural Networks (RNNs) pascanu2013difficulty (); dorobantu2016dizzyrnn (); arjovsky2016unitary (); mhammedi2016efficient (); vorontsov2017orthogonality (); wisdom2016full () to help avoid gradient vanishing/explosion. pascanu2013difficulty () proposed a soft constraint technique to combat vanishing gradient, by forcing the Jacobian matrices to preserve energy measured by Frobenius norm. The more recent study vorontsov2017orthogonality () investigated the effect of soft versus hard orthogonal constraints on the performance of RNNs, the former by specifying an allowable range for the maximum singular value of the transition matrix and thus allowing for its small intervals around one.
In CNNs, orthogonal weights are also recognized to stabilize the layerwise distribution of activations rodriguez2016regularizing () and make optimization more efficient. saxe2013exact (); mishkin2015all () presented the idea of orthogonal weight initialization in CNNs, which is driven by the normpreserving property of orthogonal matrix: a similar outcome which BN tried to achieve. saxe2013exact () analyzed the nonlinear dynamics of CNN training. Under simplified assumptions, they concluded that random orthogonal initialization of weights will give rise to the same convergence rate as unsupervised pretraining, and will be superior than random Gaussian initialization. However, a good initial condition such as orthogonality does not necessarily sustain throughout training. In fact, the weight orthogonality and isometry will break down easily when training starts, if not properly regularized saxe2013exact (). Several recent works harandi2016generalized (); ozay2016optimization (); huang2017orthogonal () considered Stiefel manifoldbased hard constraints of weights. harandi2016generalized () proposed a Stiefel layer to guarantee fully connected layers to be orthogonal by using Reimannian gradients, without considering similar handling for convolutional layers; their performance reported on VGG networks simonyan2014very () were less than promising. ozay2016optimization () extended Riemannian optimization to convolutional layers and require filters within the same channel to be orthogonal. To overcome the challenge that CNN weights are usually rectangular rather than square matrices, huang2017orthogonal () generalized Stiefel manifold property and formulated an Optimization over Multiple Dependent Stiefel Manifolds (OMDSM) problem. Different from ozay2016optimization (), it ensured filters across channels to be orthogonal. A related work jia2016improving () adopted a Singular Value Bounding (SVB) method, via explicitly thresholding the singular values of weight matrices between a prespecified narrow band around the value of one.
The above methods jia2016improving (); harandi2016generalized (); ozay2016optimization (); huang2017orthogonal () all fall in the category of enforcing “hard orthogonality constraints” into optimization (jia2016improving () could be viewed as a relaxed constraint), and have to repeat singular value decomposition (SVD) during training. The cost of SVD on highdimensional matrices is expensive even in GPUs, which is one reason why we choose not to go for the “hard constraint” direction in this paper. Moreover, since CNN weight matrices cannot exactly lie on a Stiefel manifold as they are either very “thin” or “fat” (e.g., may never happen for an overcomplete “fat” due to rank deficiency of its gram matrix), special treatments are needed to maintain the hard constraint. For example, huang2017orthogonal () proposed group based orthogonalization to first divide an overcomplete weight matrix into “thin” columnwise groups, and then applying Stiefel manifold constraints groupwise. The strategy was also motivated by reducing the computational burden of computing largescale SVDs. Lately, balestriero2018spline (); balestriero2018mad () interpreted CNNs as Template Matching Machines, and proposed a penalty term to force the templates to be orthogonal with each other, leading to significantly improved classification performance and reduced overfitting with no change to the deep architecture.
A recent work xie2017all () explored orthogonal regularization, by enforcing the Gram matrix of each weight matrix to be close to identity under Frobenius norm. It constrains orthogonality among filters in one layer, leading to smaller correlations among learned features and implicitly reducing the filter redundancy. Such a soft orthonormal regularizer is differentiable and requires no SVD, thus being computationally cheaper than its “hard constraint” siblings. However, we will see later that Frobenius normbased orthogonality regularization is only a rough approximation, and is inaccurate for “fat” matrices as well. The authors relied on a backward error modulation step, as well as similar groupwise orthogonalization as in huang2017orthogonal (). We also notice that xie2017all () displayed the strong advantage of enforcing orthogonality in training the authors’ selfdesigned plain deep CNNs (i.e. without residual connections). However, they found fewer performance impacts when applying the same to training prevalent network architectures such as ResNet he2016deep (). In comparison, our orthogonality regularizations can be added to CNNs as “plugandplay” components, without any other modification needed. We observe evident improvements brought by them on most popular ResNet architectures.
Finally, we briefly outline a few works related to orthogonality in more general senses. One may notice that enforcing matrix to be (near)orthogonal during training will lead to its spectral norm being always equal (or close) to one, which links between regularizing orthogonality and spectrum. In wang2016analysis (), the authors showed that the spectrum of Extended Data Jacobian Matrix (EDJM) affected the network performance, and proposed a spectral soft regularizer that encourages major singular values of EDJM to be closer to the largest one. keskar2016large () claimed that the maximum eigenvalue of the Hessian predicted the generalizability of CNNs. Motivated by that, yoshida2017spectral () penalized the spectral norm of weight matrices in CNNs. A similar idea was later extended in miyato2018spectral () for training generative adversarial networks, by proposing a spectral normalization technique to normalize the spectral norm/Lipschitz norm of the weight matrix to be one.
3 Deriving New Orthogonality Regularizations
In this section, we will derive and discuss several orthogonality regularizers. Note that those regularizers are applicable to both fullyconnected and convolutional layers. The default mathematical expressions of regularizers will be assumed on a fullyconnected layer ( could be either larger or smaller than ). For a convolutional layer , where are filter width, filter height, input channel number and output channel number, respectively, we will first reshape into a matrix form , where and . The setting for regularizing convolutional layers follows xie2017all (); huang2017orthogonal () to enforces orthogonality across filter, encouraging filter diversity. All our regularizations are directly amendable to almost any CNN: there is no change needed on the network architecture, nor any other training protocol (unless otherwise specified).
3.1 Baseline: Soft Orthogonality Regularization
Previous works xie2017all (); balestriero2018spline (); balestriero2018mad () proposed to require the Gram matrix of the weight matrix to be close to identity, which we term as Soft Orthogonality (SO) regularization:
(1) 
where is the regularization coefficient (the same hereinafter). It is a straightforward relaxation from the “hard orthogonality” assumption harandi2016generalized (); ozay2016optimization (); huang2017orthogonal (); wang2018learning () under the standard Frobenius norm, and can be viewed as a different weight decay term limiting the set of parameters close to a Stiefel manifold rather than inside a hypersphere. The gradient is given in an explicit form: , and can be directly appended to the original gradient w.r.t. the current weight .
However, SO (1) is flawed for an obvious reason: the columns of could possibly be mutually orthogonal, if and only if is undercomplete (). For overcomplete (), its gram matrix cannot be even close to identity, because its rank is at most , making a biased minimization objective. In practice, both cases can be found for layerwise weight dimensions. The authors of huang2017orthogonal (); xie2017all () advocated to further divide overcomplete into undercomplete column groups to resolve the rank deficiency trap. In this paper, we choose to simply use the original SO version (1) as a fair comparison baseline.
The authors of xie2017all () argued against the hybrid utilization of the original weight decay and the SO regularization. They suggested to stick to one type of regularization all along training. Our experiments also find that applying both together throughout training will hurt the final accuracy. Instead of simply discarding weight decay, we discover a scheme change approach which is validated to be most beneficial to performance, details on this can be found in Section 4.1.
3.2 Double Soft Orthogonality Regularization
The double soft orthogonality regularization extends SO in the following form:
(2) 
Note that an orthogonal will satisfy ; an overcomplete can be regularized to have small but will likely have large residual , and vice versa for an undercomplete . DSO is thus designed to cover both overcomplete and undercomplete cases; for either case, at least one term in (2) can be well suppressed, requiring either rows or columns of to stay orthogonal. It is a straightforward extension from SO.
Another similar alternative to DSO is “selective” soft orthogonality regularization, defined as: , if ; if . Our experiments find that DSO always outperforms the selective regularization, therefore we only report DSO results.
3.3 Mutual Coherence Regularization
The mutual coherence donoho2006compressed () of is defined as:
(3) 
where denotes the th column of , . The mutual coherence (3) takes values between [0,1], and measures the highest correlation between any two columns of . In order for to have orthogonal or nearorthogonal columns, should be as low as possible (zero if ).
We wish to suppress as an alternative way to enforce orthogonality. Assume has been first normalized to have unitnorm columns, is essentially the the element of the Gram matrix , and requires us to consider offdiagonal elements only. Therefore, we propose the following mutual coherence (MC) regularization term inspired by (3:
(4) 
Although we do not explicitly normalize the column norm of to be one, we find experimentally that minimizing (4) often tends to implicitly encourage closetounitcolumnnorm too, making the objective of (4) a viable approximation of mutual coherence (3)
The gradient of could be explicitly solved by applying a smoothing technique to the nonsmooth norm, e.g., lin2015optimized (). However, it will invoke an iterative routine each time to compute ball proximal projection, which is less efficient in our scenario where massive gradient computations are needed. In view of that, we turn to using autodifferentiation to approximately compute the gradient of (4) w.r.t. .
3.4 Spectral Restricted Isometry Property Regularization
Recall that the RIP condition candes2005decoding () of assumes:
Assumption 1
For all vectors that is sparse, there exists a small s.t. .
The above RIP condition essentially requires that every set of columns in , with cardinality no larger than , shall behave like an orthogonal system. If taking an extreme case with , RIP then turns into another criterion that enforces the entire to be close to orthogonal. Note that both mutual incoherence and RIP are well defined for both undercomplete and overcomplete matrices.
We rewrite the special RIP condition with in the form below:
(5) 
Notice that is the spectral norm of , i.e., the largest singular value of . As a result, . In order to enforce orthogonality to from an RIP perspective, one may wish to minimize the RIP constant in the special case , which according to the definition should be chosen as as from (5). Therefore, we end up equivalently minimizing the spectral norm of :
(6) 
It is termed as the Spectral Restricted Isometry Property (SRIP) regularization.
The above reveals an interesting hidden link: regularizations with spectral norms were previously investigated in yoshida2017spectral (); miyato2018spectral (), through analyzing small perturbation robustness and Lipschitz constant. The spectral norm rearises from enforcing orthogonality when RIP condition is adopted. But compared to the spectral norm (SN) regularization yoshida2017spectral () which minimizes , SRIP is instead enforced on . Also compared to miyato2018spectral () requiring the spectral norm of to be exactly 1 (developed for GANs), SRIP requires all singular values of to be close to 1, which is essentially stricter because the resulting needs also be well conditioned.
We again refer to auto differentiation to compute the gradient of (6) for simplicity. However, even computing the objective value of (6) can invoke the computationally expensive EVD. To avoid that, we approximate the computation of spectral norm using the power iteration method. Starting with a randomly initialized , we iteratively perform the following procedure a small number of times (2 times by default) :
(7) 
With such a rough approximation as proposed, SRIP reduces computational cost from to , and is practically much faster for implementation.
4 Experiments on Benchmarks
First of all, we will base our experiments on several popular stateoftheart models: ResNethe2016deep (); he2016identity () (including several different variants), Wide ResNetzagoruyko2016wide () and ResNextxie2017aggregated (). For fairness, all preprocessing, data augmentation and training/validation/testing splitting are strictly identical to the original training protocols in zagoruyko2016wide (); he2016deep (); he2016identity (); xie2017aggregated (). All hyperparameters and architectural details remain unchanged too, unless otherwise specified.
We structure the experiment section in the following way. In the first part of experiments, we design a set of intensive experiments on CIFAR 10 and CIFAR100, which consist of 60,000 images of size 3232 with a 51 trainingtesting split, divided into 10 and 100 classes respectively. We will train each of the three models with each of the proposed regularizers, and compare their performance with the original versions, in terms of both final accuracy and convergence. In the second part, we further conduct experiments on ImageNet and SVHN datasets. In both parts, we also compare our best performer SRIP with existing regularization methods with similar purposes.
Scheme Change for Regularization Coefficients
All the regularizers have an associated regularization coefficient denoted by , whose choice play an important role in the regularized training process. Correspondingly, we denote the regularization coefficient for the weight decay used by original models as . From experiments, we observe that fully replacing weight decay with orthogonal regularizers will accelerate and stabilize training at the beginning of training, but will negatively affect the final accuracies achievable. We conjecture that while the orthogonal parameter structure is most beneficial at the initial stage, it might be overly strict when training comes to the final “fine tune” stage, when we should allow for more flexibility for parameters. In view of that, we did extensive ablation experiments and identify a switching scheme between two regularizations, at the beginning and late stages of training. Concretely, we gradually reduce (initially 0.10.2) to , and , after 20, 50 and 70 epochs, respectively, and finally set it to zero after 120 epochs. For , we start with ; then for SO/DSO regularizers, we increase to /, after 20 epochs. For MC/SRIP regularizers, we find them insensitive to the choice of , potentially due to their stronger effects in enforcing close to ; we thus stick to the initial throughout training for them. Such an empirical “scheme change” design is found to work nicely with all models, benefiting both accuracy and efficiency. The above / choices apply to all our experiments.
As pointed out by one anonymous reviewer, applying orthogonal regularization will change the optimization landscape, and its power seems to be a complex and dynamic story throughout training. In general, we find it to show a strong positive impact at the early stage of training (not just initialization), which concurs with previous observations. But such impact is observed to become increasingly negligible, and sometime (slightly) negative, when the training approaches the end. That trend seems to be the same for all our regularizers.
4.1 Experiments on CIFAR10 and CIFAR100
We employ three model configurations on the CIFAR10 and CIFAR100 datasets:
ResNet 110 Model he2016deep ()
The 110layer ResNet Model he2016deep () is a very strong and popular ResNet version. It uses Bottleneck Residual Units, with a formula setting given by , where n denotes the total number of convolutional blocks used and p the total depth. We use the Adam optimizer to train the model for 200 epochs, with learning rate starting with 1e2, and then subsequently decreasing to , and , after 80, 120 and 160 epochs, respectively.
Wide ResNet 2810 Model zagoruyko2016wide ()
For the Wide ResNet model zagoruyko2016wide (), we use depth 28 and (width) 10 here, as this configuration gives the best accuracies for both CIFAR10 and CIFAR100, and is (relatively) computationally efficient. The model uses a Basic Block B(3,3), as defined in ResNet he2016deep (). We use the SGD optimizer with a Nesterov Momentum of 0.9 to train the model for 200 epochs. The learning rate starts at 0.1, and is then decreased by a factor of 5, after 60, 120 and 160 epochs, respectively. We have followed all other settings of zagoruyko2016wide () identically.
ResNext 29864 Model xie2017aggregated ()
For ResNext Model xie2017aggregated (), we consider the 29layer architecture with a cardinality of 8 and widening factor as 4, which reported the best stateoftheart CIFAR10/CIFAR100 results compared to other contemporary models with similar amounts of trainable parameters. We use the SGD optimizer with a Nesterov Momentum of 0.9 to train the model for 300 epochs. The learning starts from 0.1, and decays by a factor of 10 after 150 and 225 epochs, respectively.
Results
Table 1 compares the top1 error rates in the three groups of experiments. To summarize, SRIP is obviously the winner in almost all cases (except the second best for ResNet110, CIFAR100), with remarkable performance gains, such as an impressive 2.31% top1 error reduction for Wide ResNet2810. SO acts a surprisingly strong baseline and is often only next to SRIP. MC can usually outperform the original baseline but remains inferior to SRIP and SO. DSO seems the most ineffective among all four, and might perform even worse than the original baseline. We also carefully inspect the training curves (in term of validation accuracies w.r.t epoch numbers) of different methods on CIFAR10 and CIFAR100, with ResNet110 curves shown in Fig. 1 for example. All starting from random scratch, we observe that all four regularizers significantly accelerate the training process in the initial training stage, and maintain at higher accuracies throughout (most part of) training, compared to the unregularized original version. The regularizers can also stabilize the training in terms of less fluctuations of the training curves. We defer a more detailed analysis to Section 4.3.
Model  Regularizer  CIFAR10  CIFAR100 

ResNet110 he2016deep ()  None  7.04*  25.42* 
SO  6.78  25.01  
DSO  7.04  25.83  
MC  6.97  25.43  
SRIP  6.55  25.14  
Wide ResNet 2810 zagoruyko2016wide ()  None  4.16*  20.50* 
SO  3.76  18.56  
DSO  3.86  18.21  
MC  3.68  18.90  
SRIP  3.60  18.19  
ResNext 29864 xie2017aggregated ()  None  3.70*  18.53* 
SO  3.58  17.59  
DSO  3.85  19.78  
MC  3.65  17.62  
SRIP  3.48  16.99 
Besides, we validate the helpfulness of scheme change. For example, we train Wide ResNet 2810 with SRIP, but without scheme change (all else remains the same). We witness a top1 error increase on CIFAR10, and on CIFAR100, although still outperforming the original unregularized models. Other regularizers perform even worse without scheme change.
Comparison with Spectral Regularization
We compare SRIP with the spectral regularization (SR) developed in yoshida2017spectral (): , with the authors’ default . All other settings in yoshida2017spectral () have been followed identically. We apply the SR regularization to training the Wide ResNet2810 Model and the ResNext 29864 Model. For the former, we obtain a top1 error rate of 3.93% on CIFAR10, and 19.08% on CIFAR100. For the latter, the top1 error rate is 3.54% for CIFAR10, and 17.27% for CIFAR100. Both are inferior to SRIP results from the same settings of Table 1.
Comparison with Optimization over Multiple Dependent Stiefel Manifolds Omdsm
We also compare SRIP with OMDSM developed in huang2017orthogonal (), which makes a fair comparison with ours, on soft regularization forms versus hard constraint forms of enforcing orthogonality. This work trained Wide ResNet 2810 on CIFAR10 and CIFAR100 and got error rates 3.73% and 18.76% respectively, both being inferior to SRIP (3.60% for CIFAR10 and 18.19% for CIFAR100).
Comparison with Jacobian Norm Regularization
A recent work sokolic2017robust () propounds the idea of using the norm of the CNN Jacobian as a training regularizer. The paper used a variant of Wide ResNet zagoruyko2016wide () with 22 layers of width 5, whose original top1 error rate was 6.66% on on CIFAR10, and and reported a reduced error rate of 5.68% with their proposed regularizer. We trained this same model using SRIP over the same augmented full training set, achieving 4.28% top1 error, that shows a large gain over the Jacobian normbased regularizer.
4.2 Experiments on ImageNet and SVHN
We extend the experiments to two larger and more complicated datasets: ImageNet and SVHN (Street View House Numbers). Since SRIP clearly performs the best in the above experiments, among the proposed four, we will focus on comparing SRIP only.
Model  Regularizer  ImageNet 
ResNet 34 he2016deep ()  None  9.84 
OMDSM huang2017orthogonal ()  9.68  
SRIP  8.32  
PreResnet 34 he2016identity ()  None  9.79 
OMDSM huang2017orthogonal ()  9.45  
SRIP  8.79  
ResNet 50 he2016deep ()  None  7.02 
SRIP  6.87 
Experiments on ImageNet
We train ResNet 34, PreResNet 34 and ResNet 50 he2016identity () on the ImageNet dataset with and without SRIP regularizer, respectively. The training hyperparameters settings are consistent with the original models. The initial learning rate is set to 0.1, and decreases at epoch 30, 60, 90 and 120 by a factor of 10. The top5 error rates are then reported on the ILSVRC2012 val set, with single model and singlecrop. huang2017orthogonal () also reported their top5 error rates with both ResNet 34 and PreResNet 34 on ImageNet. As seen in Table 2. SRIP clearly outperforms the best for all three models.
Experiments on SVHN
On the SVHN dataset, we train the original Wide ResNet 168 model, following its original implementation in zagoruyko2016wide () with initial learning 0.01 which decays at epoch 60,120 and 160 all by a factor of 5. We then train the SRIPregularized version with no change made other than adding the regularizer. While the original Wide ResNet 168 gives rise to an error rate of 1.63%, SRIP reduces it to 1.56%.
4.3 Summary, Remarks and Insights
From our extensive experiments with stateoftheart models on popular benchmarks, we can conclude the following points:

In response to the question in our title: Yes, we can gain a lot from simply adding orthogonality regularizations into training. The gains can be found in both final achievable accuracy and empirical convergence.
For the former, the three models have obtained (at most) 0.49%, 0.56%, and 0.22% top1 accuracy gains on CIFAR10, and 0.41%, 2.31%, and 1.54% on CIFAR100, respectively. For the latter, positive impacts are widely observed in our training and validation curves (Figure 1 as a representative example), in particular faster and smoother curves at the initial stage. Note that those impressive improvements are obtained with no other changes made, and is extended to datasets such as ImageNet and SVHN.

With its nice theoretical grounds, SRIP is also the best practical option among all four proposed regularizations. It consistently performs the best in achieving the highest accuracy as well as accelerating/stabilizing training curves. It also outperforms other recent methods utilizing spectral norm yoshida2017spectral (), hard orthogonality huang2017orthogonal (), and Jacobian norm sokolic2017robust ()

Despite its simplicity (and potential estimation bias), SO is a surprisingly robust baseline and frequently ranks second among all four. We conjecture that SO benefits from its smooth form and continuous gradient, which facilitates the gradientbased optimization, while both SRIP and MC have to deal with nonsmooth problems.

DSO does not seem to be helpful. It often performs worse than SO, and sometimes even worse than the unregularized original model. We interpret it by recalling how the matrix is constructed (Section 3 beginning): enforcing close to has “interchannel” effects (i.e., requiring different output channels to have orthogonal filter groups); whereas enforcing close to enforce “intrachannel” orthogonality (i.e., same spatial locations across different filter groups have to be orthogonal). The former is a better accepted idea. Our results on DSO seems to provide further evidence (from the counter side) that orthogonality should be primarily considered for “interchannel”, i.e., between columns of .

MC brings in certain improvements, but not as significantly as SRIP. We notice that (4) will approximate (3) well only when has unit columns. While we find minimizing (4) generally has the empirical results of approximately normalizing columns, it is not exactly enforced all the time. As we observed from experiments, large deviations of columnwise norms could occur at some point of training and potentially bring in negative impacts. We plan to look for reparameterization of to ensure unit norms throughout training, e.g., through integrating MC with weight normalization salimans2016weight (), in future work.

In contrast to many SVDbased hard orthogonality approaches, our proposed regularizers are light to use and incur negligible extra training complexity. Our experiments show that the periteration (batch) running time remains almost unchanged with or without our regularizers. Additionally, the improvements by regularization prove to be stable and reproducible. For example, we tried to train Wide ResNet 2810 with SRIP from three different random initializations (all other protocols unchanged), and find that the final accuracies very stable (deviation smaller than ), with best accuracy .
5 Conclusion
We presented an efficient mechanism for regularizing different flavors of orthogonality, on several stateofart convolutional deep CNNs zagoruyko2016wide (); he2016deep (); xie2017aggregated (). We showed that in all cases, we can achieve better accuracy, more stable training curve and smoother convergence. In almost all times, the novel SRIP regularizer outperforms all else consistently and remarkably. Those regularizations demonstrate outstanding generality and easiness to use, suggesting that orthogonality regularizations should be considered as standard tools for training deeper CNNs. As future work, we are interested to extend the evaluation of SRIP to training RNNs and GANs. Summarizing results, a befitting quote would be: Enforce orthogonality in training your CNN and by no means will you regret!
Footnotes
 https://github.com/nbansal90/CanweGainMorefromOrthogonality
 We also tried to first normalize columns of and then apply (4), without finding any performance benefits.
References
 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 249–256, 2010.
 Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456, 2015.
 Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Advances in neural information processing systems, pages 2933–2941, 2014.
 Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 Jianping Zhou, Minh N Do, and Jelena Kovacevic. Special paraunitary matrices, cayley transform, and multidimensional orthogonal filter banks. IEEE Transactions on Image Processing, 15(2):511–519, 2006.
 Pau Rodríguez, Jordi Gonzalez, Guillem Cucurull, Josep M Gonfaus, and Xavier Roca. Regularizing cnns with locally constrained decorrelations. arXiv preprint arXiv:1611.01967, 2016.
 Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, et al. Natural neural networks. In Advances in Neural Information Processing Systems, pages 2071–2079, 2015.
 Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv preprint arXiv:1511.06422, 2015.
 Kui Jia, Dacheng Tao, Shenghua Gao, and Xiangmin Xu. Improving training of deep neural networks via singular value bounding. CoRR, abs/1611.06013, 2016.
 Mehrtash Harandi and Basura Fernando. Generalized backpropagation,’E tude de cas: Orthogonality. arXiv preprint arXiv:1611.05927, 2016.
 Mete Ozay and Takayuki Okatani. Optimization on submanifolds of convolution kernels in cnns. arXiv preprint arXiv:1610.07008, 2016.
 Di Xie, Jiang Xiong, and Shiliang Pu. All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. arXiv preprint arXiv:1703.01827, 2017.
 Lei Huang, Xianglong Liu, Bo Lang, Adams Wei Yu, and Bo Li. Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks. arXiv preprint arXiv:1709.06079, 2017.
 Yifan Sun, Liang Zheng, Weijian Deng, and Shengjin Wang. Svdnet for pedestrian retrieval. arXiv preprint, 2017.
 Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE transactions on information theory, 51(12):4203–4215, 2005.
 David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.
 Zhaowen Wang, Jianchao Yang, Haichao Zhang, Zhangyang Wang, Yingzhen Yang, Ding Liu, and Thomas S Huang. Sparse Coding and its Applications in Computer Vision. World Scientific, 2016.
 Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
 Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
 Tong Zhang. Sparse recovery with orthogonal matching pursuit under rip. IEEE Transactions on Information Theory, 57(9):6215–6221, 2011.
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
 Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pages 901–909, 2016.
 Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310–1318, 2013.
 Victor Dorobantu, Per Andre Stromhaug, and Jess Renteria. Dizzyrnn: Reparameterizing recurrent neural networks for normpreserving backpropagation. arXiv preprint arXiv:1612.04035, 2016.
 Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In International Conference on Machine Learning, pages 1120–1128, 2016.
 Zakaria Mhammedi, Andrew Hellicar, Ashfaqur Rahman, and James Bailey. Efficient orthogonal parametrisation of recurrent neural networks using householder reflections. arXiv preprint arXiv:1612.00188, 2016.
 Eugene Vorontsov, Chiheb Trabelsi, Samuel Kadoury, and Chris Pal. On orthogonality and learning recurrent networks with long term dependencies. arXiv preprint arXiv:1702.00071, 2017.
 Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. Fullcapacity unitary recurrent neural networks. In Advances in Neural Information Processing Systems, pages 4880–4888, 2016.
 Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Randall Balestriero and Richard Baraniuk. A spline theory of deep networks. Proceedings of the 35th International Conference on Machine Learning (ICML), 2018.
 Randall Balestriero and Richard Baraniuk. Mad max: Affine spline insights into deep learning. arXiv preprint arXiv:1805.06576, 2018.
 Shengjie Wang, Abdelrahman Mohamed, Rich Caruana, Jeff Bilmes, Matthai Plilipose, Matthew Richardson, Krzysztof Geras, Gregor Urban, and Ozlem Aslan. Analysis of deep neural networks with extended data jacobian matrix. In International Conference on Machine Learning, pages 718–726, 2016.
 Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On largebatch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
 Yuichi Yoshida and Takeru Miyato. Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941, 2017.
 Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
 Zhangyang Wang, Hongyu Xu, Haichuan Yang, Ding Liu, and Ji Liu. Learning simple thresholded features with sparse support recovery. arXiv preprint arXiv:1804.05515, 2018.
 Zhouchen Lin, Canyi Lu, and Huan Li. Optimized projections for compressed sensing via direct mutual coherence minimization. arXiv preprint arXiv:1508.03117, 2015.
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
 Jure Sokolić, Raja Giryes, Guillermo Sapiro, and Miguel RD Rodrigues. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 65(16):4265–4280, 2017.