Stable Rank Normalization for Improved Generalization in Neural Networks and GANs

# Stable Rank Normalization for Improved Generalization in Neural Networks and GANs

Amartya Sanyal email: amartya.sanyal@cs.ox.ac.uk, phst@robots.ox.ac.uk, puneet@robots.ox.ac.uk Philip H.S. Torr Department of Engineering Science, University of Oxford Puneet K. Dokania Department of Engineering Science, University of Oxford
###### Abstract

Exciting new work on the generalization bounds for neural networks (NN)given by Neyshabur et al. (2018); Bartlett et al. (2017) closely depend on two parameter-depenedent quantities: the Lipschitzconstant upper-bound and the stable rank (a softer version of the rank operator). This leads to an interesting question of whether controlling these quantities might improve the generalization behaviour of NNs. To this end, we propose stable rank normalization (SRN), a novel, optimal, and computationally efficient weight-normalization scheme which minimizes the stable rank of a linear operator. Surprisingly we find that SRN, inspite of being non-convex problem, can be shown to have a unique optimal solution. Moreover, we show that SRN allows control of the data-dependent empirical Lipschitzconstant, which in contrast to the Lipschitzupper-bound, reflects the true behaviour of a model on a given dataset. We provide thorough analyses to show that SRN, when applied to the linear layers of a NNfor classification, provides striking improvements— on the generalization gap compared to the standard NNalong with significant reduction in memorization. When applied to the discriminator of GANs (called SRN-GAN) it improves Inception, FID, and Neural divergence scores on the CIFAR 10/100 and CelebA datasets, while learning mappings with low empirical Lipschitzconstants.

## 1 Introduction

Deep neural networks have shown astonishing ability to tackle a wide variety of problems and have shown a great ability to generalize. Within this work we leverage very recent and important theoretical results on the generalization bounds of deep networks to yield a very practical low cost method to normalize the weights within a network using a scheme which we call Stable Rank Normalization (SRN). The motivation behind SRNcomes from the generalization bound of NNgiven by Neyshabur et al. (2018) and Bartlett et al. (2017), 111 and represents the number of layers and the spectral norm of the -th linear layer , respectively. Note, only terms depending on the model parameters are mentioned here and ignores log terms. , that depends on two parameter-dependent quantities: the scale-dependent Lipschitzconstant upper-bound and the scale-independent stable rank (, refer Definition 3.1), a softer version of the rank operator. The empirical impact of directly controlling these qunatities on the generalization behaviour of NNs has not been explored yet. In this work, we consider both these quantities and based on extensive experiments we show that, indeed, controlling them remarkably improves the generalization (and memorization) behaviour of NNs and training of Generative Adversarial Networks (GAN) Goodfellow et al. (2014). Note, our results are even more significant in context of the seminal work by Zhang et al. (2018b), where one of their observations was that regularizors like weight decay and dropout has little impact on the generalization of NNs.

Recently, significant attention has been given on learning low Lipschitzfunctions showing that, along with providing better generalization (Anthony and Bartlett, 2009; Bartlett et al., 2017; Neyshabur et al., 2018, 2015; Yoshida and Miyato, 2017; Gouk et al., 2018), they also help in the stable training of GANs (Arjovsky et al., 2017; Gulrajani et al., 2017; Miyato et al., 2018) and robustness against adversarial attacks (Cisse et al., 2017). However, even though learning low Lipschitzfunctions is desirable, bounding it alone is not sufficient to provide a realistic guarantee on the generalization error.  Arora et al. (2018) also suggested that the worst-case Lipschitzconstant often provides vacuous generalization bounds. An easy example is that scaling an entire ReLU network by a constant will not alter the classification behaviour (and thus the generalization), however, can massively increase the Lipschitzconstant. These arguments clearly suggest that along with the Lipschitzconstant (the first parameter-based quantity in the generalization bound), regularizing the stable rank (the second quantity) can be extremely useful for improved generalization of NNs.

To this end, we propose SRNthat explicitly allows us to control the stable rank of each linear layer of any NN. Precisely, we formulate a novel and generic objective function Eq. 4 that along with normalizing the stable rank of a given matrix, also allows preservation of a part of the spectrum of the matrix. For example, one might want to preserve the top singular values of the given matrix while modifying it such that it has the desired stable rank and is closest to the original matrix in terms of Frobenius norm. When (no singluar value preservation constraint), the objective function turns out to be non-convex, otherwise, convex. We would like to emphasize that we provide optimal unique solutions to SRN(problem Eq. 4), for both non-convex and convex cases, with theoretical guarantees and extensive proofs (Section 3.1). In terms of algorithmic similarity, SRNis similar to Spectral Normalization (SN) (Miyato et al., 2018) in the sense that it scales singular values, however, the scaling provides a new mapping with desired stable rank. Computationally (Section 3.1), it only requires computing the first singular value (when ), which can be efficiently obtained using the power iteration method (Mises and Pollaczek-Geiringer, 1929).

Furthermore, we argue that the said upper-bound on the Lipschitzconstant (the first quantity in the generalization bound), along with being scale-dependent, is also data-independant and hence, is a very pessimistic estimate of the true behaviour of the given network on a particular task or dataset. Thus, instead of the Lipschitzupper-bound, we look at the data-dependent empirical estimate of the Lipschitzconstant  (refer  Section 2). Using a simple two-layer linear NN(refer Section 3), we first show that reducing the rank of individual linear layers can reduce without changing the spectral norms (hence the Lipschitzupper-bound). Motivated by this, we experimentally analyse the effect of SRNon and show that it indeed allows us to learn mappings with low empirical Lipschitz. Thus, SRN, along with controlling the stable rank (the second qunatity), also controls (indirectly) the empirical estimate of the Lipschitzconstant, the first quantity in the generalization bound.

The improved generalization effect of SRNcan further be explained by the minimum description length (MDL) based arguments, which suggest that the solution with low MDL are generally flat in nature and are more generalizable compared to the high MDL counterparts (sharp minimas) Hochreiter and Schmidhuber (1997). Thus, an optimum obtained using low rank (stable) weights, which requires less number of bits to describe, must be relatively flat in nature, and hence, more generalizable.

Even though SRNis applicable to any problem involving a sequence of affine transformations, we show our experiments on deep neural networks. Specifically, on classification (CIFAR10/100), a NNtrained using SRN while maintaining the accuracy, strikingly improves generalization and significantly reduces memorization. Additionally, on GANs, it learns discriminators with low empirical Lipschitzconstant while providing improved Inception, FID and Neural divergence scores Gulrajani et al. (2019).

## 2 Background and Intuitions

#### Neural Networks

Consider to be a feed-forward multilayer NNparameterized by , each layer of which consists of a linear followed by a non-linear222e.g. ReLU, tanh, sigmoid, and maxout. mapping. Let be the input (or pre-activations) to the -th layer, then the output (or activations) of this layer is represented as , where is the output of the linear (affine) layer parameterized by the weights and biases , and is the element-wise non-linear function applied to . For classification tasks, given a dataset with input-output pairs denoted as 333 is the -th element of vector . Only one class is assigned as the ground-truth label to each ., the parameter vector is learned using back-propagation to optimize the classification loss (e.g., cross-entropy).

#### LipschitzConstant

Here we describe the global, the local, and the empirical (data-dependent) Lipschitzconstants. Briefly, Lipschitzconstant is the quantification of the sensitivity of the output with respect to the change in the input. A function is globally L-Lipschitzcontinuous if , where and represents the norms in the input and the output metric spaces, respectively. The global Lipschitzconstant is:

 Lg=maxxi,xj∈Rm∥∥f(xi)−f(xj)∥∥q∥∥xi−xj∥∥p. (1)

The above definition of the Lipschitzconstant depends on all the pairs of inputs (thus, global). However, one can define the local Lipschitzconstant based on the sensitivity of in the vicinity of a given point . Precisely, at , the local Lipschitzconstant is computed on the open ball of radius (can be arbitrarily small) centered at . Let , , then, similar to , the local Lipschitzconstant of at , , is greater than or equal to . Assuming to be Fréchet differentiable, as , using , can be upper bounded using . A function is said to be locally Lipschitz with Local Lipschitz constant if for all there exists a local Lipschitzconstant at . Here, is the Jacobian and denotes the matrix (operator) norm. Thus,

 Ll(x)=maxh≠0∥h∥p<δ∥∥Jf(x)h∥∥q∥h∥p=maxh≠0h∈Rm∥∥Jf(x)h∥∥q∥h∥p=∥∥Jf(x)∥∥p,q. (2)

and . Notice that the Lipschitzconstant (global or local), greatly depends on the chosen norms. When , the upperbound on the local Lipschitzconstant at boils down to the 2-matrix norm (maximum singular value) of the Jacobian .

#### Empirical Lipschitz

In practice, the behaviour of a model is captured and evaluated using the training and the test data. Neither during training nor during testing does the model have access to the entire domain and thus its behaviour on the domain outside the data distribution is of little significance. We thus compute an empirical estimate of and over task specific dataset which we call local and global , respectively. Depending on the task, can either be the training/test data, the generated data (e.g., in generative models), or some interploted data. Additionally, Proposition B.1 shows the relationship between the global and the local and,  Novak et al. (2018) provided empirical results showing how local (in the vicinity of train data) is correlated with the generalization of NNs. This further supports using data-depedent to better understand the generalization behaviour.

#### The local Lipschitzupper-bound for Neural Networks

As mentioned earlier, , where, in the case of NNs (proof along with why it is loose in Appendix C)

 Ll(x)=∥∥Jf(x)∥∥p,q≤∥Wl∥p,q⋯∥W1∥p,qandLl=Ll(x) (3)

Note, the above upper bound is independent of the data and the task suggesting that it must be very loose compared the data-depedent . Even though this observation makes this bound less reliable, it is widely used as the motivation behind various regularizers that act on the operator norm (generally, 2-matrix norm) of the linear layers of NNto control the Lipschitzconstant.

## 3 Stable Rank Normalization

We begin with the definition and interesting properties of stable rank in Definition 3.1. As mentioned in Section 1, generalization bounds of NNs directly depend on the local Lipschitzupper-bound and the sum of the stable ranks of linear layers. We control both these quantities. Specifically, we propose SRN, a novel and optimal weight normalization scheme to minimize the stable rank of linear mappings. As argued, SRN, along with directly impacting the generalization bound, also minimizes which can further help in improving generalization of NNs (recall the MDL Hochreiter and Schmidhuber (1997) and Jacobian norm Novak et al. (2018) based arguments provided in Section 2 and 1). To further strengthen this argument, we first consider an example to show that learning low rank (stable) mappings can greatly reduce the data-dependent , and then propose our algorithm and show how it can be applied to any linear mapping in NNs.

###### Definition 3.1.

The Stable Rank (Rudelson and Vershynin, 2007) of an arbitrary matrix is defined as , where is the rank of the matrix. Stable rank is

• a soft version of the rank operator and, unlike rank, is less sensitive to small perturbations.

• differentiable as both Frobenius and Spectral norms are almost always differentiable.

• upperbounded by the rank: .

• invariant to scaling, implying, , for any .

#### Effect of Rank on Empirical LipschitzConstants

Let be a two-layer linear NNwith weights and . The Jacobian in this case is independent of . Thus, the local Lipschitzconstant is the same for all , implying, local . Note, in the case of 2-matrix norm reducing the rank will not affect the upperbound. However, as will be discussed below, rank reduction greatly influences the global .

Let and be random pairs from and be the difference , then, the global is . Let and be the ranks, and and the singular values of the matrices and , respectively. Let be the orthogonal projection matrix corresponding to and , the left and the right singular vectors of . Similarly, we define for corresponding to and . Then, . The upperbound, , can only be achieved if and (a perfect alignment), which is highly unlikely. In practice, not just the maximum singular values, as is the case with the Lipschitzupper-bound, rather the combination of the projection matrices and the singular values play a crucial role in providing an estimate of global . Thus, reducing the singular values, which is equivalent to minimizing the rank (or stable rank), will directly affect . For example, assigning , which in effect will reduce the rank of by one, will nullify its influence on all projections associated with . Implying, all the projections that would propagate the input via will be blocked. This, in effect, will influence ; hence the global . In a more general setting, let be the rank of the -th linear layer, then, each singular value of a -th layer can influence the maximum of many paths through which an input can be propagated. Thus, mappings with low rank (stable) will greatly reduce the gloabl . Similar arguments can be drawn for local in the case of NNwith non-linearity.

### 3.1 Optimal Solution to the Stable Rank Normalization Problem

Since stable rank is invariant to scaling (refer Definition 3.1), any normalization scheme that modifies to will have no effect on the stable rank. Examples of such schemes are SN (Miyato et al., 2018) where , and Frobenius normalization where . This makes the stable rank normalization non-trivial. As will be shown, our approach to stable rank normalization is efficient, and, as opposed to the widely used SN (Miyato et al., 2018) (optimal spectral normalization requires computing all , details with optimality proof in Section A.3), is optimal.

We now first define our new and generic objective function for the Stable Rank Normalization (SRN), and then present its optimal unique solutions, for both convex and non-convex cases. Given a matrix with rank and as the spectral partitioning index, we formulate the SRN problem as:

 (4)

where, is the desired stable rank, ’s and ’s are the singular values of and respectively. The partitioning index is used for the singluar value (or the spectrum) preservation constraint. It gives us the flexibility to obtain such that its top singular values are exactly the same as that of the original matrix. We provide the optimal unique solution to the stable rank problem Eq. 4 in Section 3.1 with extensive proofs and various insights in Section A.1. Note, at , the problem Eq. 4 is non-convex, otherwise convex. \thmt@toks\thmt@toks Given a real matrix with rank , a target spectrum (or singular value) preservation index , and a target stable rank of , the optimal solution of problem Eq. 4 is , where and . , and are the top singular values and vectors of , and, depending on , and are defined below. For simplicity, we first define , then

• If (no spectrum preservation), the problem becomes non-convex, the optimal solution to which is obtained for and . Since , .

• If , the problem is convex and the optimal solution is obtained for , and .

• Also, is monotonically increasing with for .

\thmt@toks\thmt@toks
###### Theorem 1.

Section 3.1 provides various ways of obtaining a matrix with the desired stable rank depending on the constraints. Intuitively, it partitions the given matrix into two parts, depending on , and then scales them differently in order to obtain optimal solution. The value of the partitioning index is a design choice. If there is no particular preference to , then provides the most optimal solution. In addition, the proof of Section 3.1 in Section A.1 also shows that for a particular choice of , the optimal solution requires partial SVD to obtain top singular values and vectors. It is easy to verify that as increases, decreases and thus the amount of scaling required for the second partition is much more aggressive. Refer Section A.2 for an example. Note, for , the optimal solution has the same spectral norm as that of the original matrix (as ), and it only requires scaling of using , where . However, for , notably, the optimal solution’s spectral norm is higher than that of the given matrix (as ). Algorithm 1 Stable Rank Normalization 1:, , 2:, , , 3:for  do 4:       5: Power method to get -th singular value 6:      if  then 7:             8:             9:             10:      else 11:            break 12:      end if 13:end for 14: 15:return ,      Algorithm 2 SRN for a Linear Layer in NN 1:, , learning rate , mini-batch dataset 2:Initialize with a random vector. 3:, 4: Perform power iteration 5: 6: Spectral Normalization 7: 8:if  then 9:      return 10:end if 11: Stable Rank Normalization 12:return

#### Algorithm for Stable Rank Normalization

We provide a general procedure in Algorithm 1 to solve the stable rank normalization problem for (the solution for is straightforward from Section 3.1). creftype 2 provides the properties of the algorithm. The algorithm is constructed so that the prior knowledge of the rank of the matrix is not necessary.

###### Claim 2.

Given a matrix , the desired stable rank , and the partitioning index Algorithm 1 returns and a scalar such that , and the top singular values of and are the same. If , then the solution provided is the optimal solution to the problem (4) with all the constraints satisfied, otherwise, it returns the largest up to which the spectrum is preserved. The proof trivially comes from the proof of Section 3.1.

### 3.2 Combining Stable Rank and Spectral Normalization for NNs

As discussed in Section 1, controlling both the layer-wise spectral norm and the stable rank plays a crucial role in the generalization of NNs. In addition, as discussed earlier, even though normalizing spectral norm guarantees controlling the upperbound on the Lipschitzconstant, it does not say much about the empricial Lipschitzconstant (). However, normalizing stable rank reduces as well, which is a more expressive representation of the behaviour of a model over a given dataset. Motivated by these arguments, we normalize both – the stable rank and the spectral norm of each linear layer of a NNsimultaneously. To do so, we first perform approximate SN Miyato et al. (2018), and then perform optimal SRN(using Algorithm 1) with . This ensures that the first singular value (which is now normalized) is preserved. Algorithm 2 provides a simplified procedure for the same for a given linear layer of a NN. Note, the computational cost of this algorithm is exactly the same as that of SN, which is to compute the top singular value using power iteration method.

## 4 Experiments

We now show experimental results using Algorithm 2 (SRN) on the generalization gap and the memorization of a NNon a standard classification task, and on the training of GANs (called SRN-GAN). Given a matrix , the desired stable rank is controlled using a single hyperparameter as , where . We use the same for all the linear layers and show results using various values of . It is trivial to note that if , or for a given , if , then SRN boils down to SN Miyato et al. (2018).

### 4.1 Generalization and Memorization Experiments

We perform (1) simple classification task aimed at minimizing the negative log-likelihood (NLL) on CIFAR100 to see the effect of SRN on the generalization gap; and (2) in line with the shattering experiments in Zhang et al. (2016), we randomize the labels of CIFAR100 and CIFAR10 to show how learning low stable rank mappings help in avoiding memorization as well. We use a DenseNet-40 model with 24 input channels in the first layer and a dropout of 0.2 applied after each convolution except the first one. The network is optimized using gradient descent with a momentum of , and a learning rate of 444we multiply the learning rate by 0.5 after every 25 epochs. with no preprocessing on the dataset. We use stable rank constraints as , and compare our method against standard training (Vanilla) and training with SN. We define as the empirical estimate of the loss function on the test-set. The generalization gap is the difference between the empirical train and the test losses. We show results using both, the standard classification loss and the NLL loss . More details and additional experiments with varying learning rates and pre-processing is shown in Appendix D.

#### Generalization experiments

We begin with the standard classification task experiment. Figure 0(a) shows the effect of optimizing the stable rank on the train loss, test loss, and the generalization gap over epochs. It is evident that the test loss is almost the same for all the approaches, however, the generalization gap is much lower, consistently, for the model with low stable rank. Table 1 summarizes the results of the experiments. In the case of , SRN- () consistently shows the best generalization gap (), and better than the Vanilla and the SN, respectively, while maintaining an equally good test accuracy. SRN- () while showing consistently better generalization gap than Vanilla and SN ( and , respectively), also provides better test performance. In the case of , both SRN-30 and SRN-50 consistently provide the best and . These experiments clearly suggest that SRNhas extremely desirable effect on the generalization gap without adversely affecting the capacity of the model to perform classification.

#### Memorization experiments

Here we look at the capacity of the network to shatter a dataset with randomly shuffled labels. This task can be learned only by memorizing the training dataset.

We show that stable rank constraints reduces memorization on random labels (thus, reduces the estimate of the Rademacher complexity Zhang et al. (2016)). Figure 0(c) and 0(b) shows the results on for this setting. The results on are presented in Figure 5 in Appendix D. It is evident that SRN- fits the least to the random training data. It can be interpreted as it having the least capacity to memorize the dataset. and thus, the best generalization behaviour and the lowest model capacity. Note, as shown in the generalization experiments, the same model was able to achieve a high training accuracy when the labels were not randomized. Testing whether a hypothesis class can fit the training data well but not a randomized version of the data is a key test to gain empirical insights about the generalizability of the hypothesis class (Neyshabur et al., 2017) and the class of SRN models clearly exhibit superior performance in it. Above experiments clearly indicate that SRN, while providing enough capacity for the standard classification task, provides much better generatlization gap and is remarkably less prone to memorization compared to the Vanilla and the SN.

### 4.2 SRN for the training of Generative Adversarial Networks (SRN-GAN)

In GANs, there is a natural tension between the capacity and the generalizability of the discriminator. The capacity ensures that if the the generated distribution and the data distribution are different, the discriminator has the capacity to distinguish them. At the same time, the discriminator has to be generalizable, implying, the class of hypothesis should be small enough to ensure that it is not just memorizing the dataset. Based on these arguments, we use SRN in the discriminator of GAN which we call SRN-GAN, and compare it against SN-GAN Miyato et al. (2018), WGAN-GP Gulrajani et al. (2017), and orthonormal regularization based GAN (Ortho-GAN) Brock et al. (2016). CIFAR10, CIFAR100 (Krizhevsky and Hinton, 2009) and celebA datasets (Liu et al., 2015) are used for these experiments. We show results on both, conditional and uncoditional GANs, for varying . Please refer to Section E.1 for further details about the training setup.

#### Histogram of the Empirical LipschitzConstant (eLhist)

Along with providing results using evaluation metrics such as Inception score (IS) (Salimans et al., 2016) , FID (Heusel et al., 2017), and Neural divergence score (ND) (Gulrajani et al., 2019), we use histograms of the empirical Lipschitzconstant, refered to as eLhist from nowonwards, for the purpose of analyses. For a given trained GAN (unconditional), we create pairs of samples, where each pair consists of (randomly sampled from the ‘real’ dataset) and (randomly sampled from the generator). Each pair is then passed through the discriminator to compute , which we then use to create the histogram. In the conditional setting, we first sample a class from a discrete uniform distribution over the classes, and then follow the same approach as described for the unconditional setting.

#### Effect of Stable Rank on eLhist and Inception Score

As shown in Figure 1(a), lowering the value of (agressive reduction in the stable rank) moves the histogram towards zero, implying, lower empirical Lipschitzconstant. This validates our arguments provided in Section 3. Lowering also improves inception score, however, extreme reduction in the stable rank () dramatically collapses the histogram to zero and also drops the inception score significantly. This is due to the fact that at , the capacity of the discriminator is reduced to the point that it is not able to learn to differentiate between the real and the fake samples.

In Table 4 and 2, we compare different approaches on standard metrics such as IS and FID. Stable Rank Normalization GAN (SRN-GAN)shows a consistently better FID score and an extremely competitive inception score on CIFAR10 (both conditional and unconditional setting) and CIFAR100 (unconditional setting). In Table 4, we also compare the ND loss on CIFAR10 and CelebA datasets. The neural distance/divergence (ND) has been looked as a metric more robust to memorization than FID and IS in recent works (Gulrajani et al., 2019; Arora and Zhang, 2017). We report our exact setting to compute ND in Section E.1. We essentially report the loss incurred by a fresh classifier trained to discriminate the generator distribution and the data distribution. Thus higher the loss, the better the generated images. As evident SRN-GANhas better ND scores on both datasets. For a qualitative analysis of the images, we show and compare generations from SRN-GAN, SN-GAN and other approaches in both conditional and unconditional setting on CIFAR-10, CIFAR100 and CelebA in Appendix F.

#### Comparing different approaches

In addition, in Figure 1(b), we provide eLhist for comparing different approaches as eLhist shows the data-dependant Lipschitzness. Random-GAN, as expected, has low empirical Lipschitzness and extremely poor inception score. Interestingly, WGAN-GP provides even lower than Random-GAN while providing much higher inception score. On the other hand, the Lipschitzconstant of SRN-GAN is higher than Random GAN and WGAN-GP, and lower than that of SN-GAN, while providing better inception score. This indicates that SRN-GAN allows us to obtain a better trade-off between the capacity and the generalizability of the discriminator. It also supports our argument in  Section 3 that adding SRNreduces the value of . For the purpose of analysis,  Figure 7(b) and 6(b) shows eLhist for pairs where each sample either comes from the true data or from the generator and we observe a similar trend and that the magnitude of eLhist are lower than the case in Figure 1(b). To verify that the same results hold in the conditional GANsetup, we show similar comparisons for GANs with projection discriminator (Miyato and Koyama, 2018) in Figure 7(a), 6(a) and 6 and observe a similar trend. Further, to see the value of the local Lipschitzness in the vicinity of real and generated samples we also plot the norm of the Jacobian in Figure 10 and 9 in Appendix E.2 and observe mostly a similar trend. In Section E.3 (Figure 11), we also show that the discriminator training of SRN-GANis more stable than Spectral Normalization GAN (SN-GAN).

## 5 Conclusion

We propose a new normalization (SRN) that allows us to constrain the stable rank of each affine layer of a NN, which in turn learns a mapping with low empirical Lipschitzconstant. We also provide optimality guarantees of SRN. We show that SRN improves the generalization and memorization properties of a standard classifier with a very large margin. In addition, we show that SRN improves the training of GANs and provides better inception, FID, and neural divergence scores.

## 6 Acknowledgements

AS acknowledges support from The Alan Turing Institute under the Turing Doctoral Studentship grant TU/C/000023. PHST and PKD were supported by the ERC grant ERC-2012-AdG 321162-HELIOS, EPSRC grant Seebibyte EP/M013774/1, EPSRC/MURI grant EP/N019474/1 and would also like to acknowledge the Royal Academy of Engineering and FiveAI.

## References

• Anthony and Bartlett (2009) Anthony, M. and Bartlett, P. L. (2009). Neural network learning: Theoretical foundations. cambridge university press.
• Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein gan.
• Arora and Zhang (2017) Arora, S. and Zhang, Y. (2017). Do gans actually learn the distribution? an empirical study. arXiv preprint arXiv:1706.08224.
• Arora et al. (2018) Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. (2018). Stronger generalization bounds for deep nets via a compression approach. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 254–263, Stockholmsmässan, Stockholm Sweden. PMLR.
• Bartlett et al. (2017) Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. (2017). Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249.
• Brock et al. (2016) Brock, A., Lim, T., Ritchie, J. M., and Weston, N. (2016). Neural photo editing with introspective adversarial networks. International Conference on Learning Representations.
• Cisse et al. (2017) Cisse, M., Bojanowski, P., Grave, E., Dauphin, Y., and Usunier, N. (2017). Parseval networks: Improving robustness to adversarial examples. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 854–863, International Convention Centre, Sydney, Australia. PMLR.
• Dumoulin et al. (2017) Dumoulin, V., Shlens, J., and Kudlur, M. (2017). A learned representation for artistic style. Proc. of ICLR.
• Eckart and Young (1936) Eckart, C. and Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1(3), 211–218.
• Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.
• Gouk et al. (2018) Gouk, H., Frank, E., Pfahringer, B., and Cree, M. (2018). Regularisation of neural networks by enforcing lipschitz continuity. arXiv preprint arXiv:1804.04368.
• Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. (2017). Improved training of wasserstein gans.
• Gulrajani et al. (2019) Gulrajani, I., Raffel, C., and Metz, L. (2019). Towards GAN benchmarks which require generalization. In International Conference on Learning Representations.
• He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778. IEEE.
• Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637.
• Hochreiter and Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. . (1997). Flat minima. Neural Computation, 9(1), 1–42.
• Kingma and Ba (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
• Kodali et al. (2018) Kodali, N., Hays, J., Abernethy, J., and Kira, Z. (2018). On convergence and stability of GANs.
• Krizhevsky and Hinton (2009) Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical report, Citeseer.
• Lim and Ye (2017) Lim, J. H. and Ye, J. C. (2017). Geometric gan.
• Liu et al. (2015) Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV).
• Mirsky (1960) Mirsky, L. (1960). Symmetric gauge functions and unitarily invariant norms. The quarterly journal of mathematics, 11(1), 50–59.
• Mirza and Osindero (2014) Mirza, M. and Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
• Mises and Pollaczek-Geiringer (1929) Mises, R. V. and Pollaczek-Geiringer, H. (1929). Praktische verfahren der gleichungsauflösung . ZAMM - Zeitschrift für Angewandte Mathematik und Mechanik, 9(2), 152–164.
• Miyato and Koyama (2018) Miyato, T. and Koyama, M. (2018). cgans with projection discriminator. International Conference on learning Representations.
• Miyato et al. (2018) Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. In International Conference on Learning Representations.
• Neyshabur et al. (2015) Neyshabur, B., Tomioka, R., and Srebro, N. (2015). Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401.
• Neyshabur et al. (2017) Neyshabur, B., Bhojanapalli, S., Mcallester, D., and Srebro, N. (2017). Exploring generalization in deep learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5947–5956. Curran Associates, Inc.
• Neyshabur et al. (2018) Neyshabur, B., Bhojanapalli, S., and Srebro, N. (2018). A PAC-bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations.
• Novak et al. (2018) Novak, R., Bahri, Y., Abolafia, D. A., Pennington, J., and Sohl-Dickstein, J. (2018). Sensitivity and generalization in neural networks: an empirical study. arXiv preprint arXiv:1802.08760.
• Petzka et al. (2018) Petzka, H., Fischer, A., and Lukovnikov, D. (2018). On the regularization of wasserstein GANs. In International Conference on Learning Representations.
• Ramachandran et al. (2018) Ramachandran, P., Zoph, B., and Le, Q. V. (2018). Searching for activation functions.
• Rudelson and Vershynin (2007) Rudelson, M. and Vershynin, R. (2007). Sampling from large matrices. Journal of the ACM, 54(4), 21–es.
• Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. (2016). Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242.
• Tran et al. (2017) Tran, D., Ranganath, R., and Blei, D. M. (2017). Hierarchical implicit models and likelihood-free variational inference.
• Wielandt (1955) Wielandt, H. (1955). An extremum property of sums of eigenvalues. Proceedings of the American Mathematical Society, 6(1), 106–106.
• Yoshida and Miyato (2017) Yoshida, Y. and Miyato, T. (2017). Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941, 2017.
• Zhang et al. (2016) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. International Conference on Learning Representations (ICLR).
• Zhang et al. (2018a) Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A. (2018a). Self-attention generative adversarial networks.
• Zhang et al. (2018b) Zhang, P., Liu, Q., Zhou, D., Xu, T., and He, X. (2018b). On the discrimination-generalization tradeoff in GANs. In International Conference on Learning Representations.

## Appendix A Technical Proofs

Here we provide an extensive proof of Section 3.1 (Section A.1), then give an example to show the difference between the solutions obtained using the stable rank minimization and the standard rank minimization (Section A.2), and finally also provide the optimal solution to the spectral norm problem in Section A.3. Auxiliary lemmas on which our proof depends are provided in Section A.4.

### a.1 Proof for Optimal Stable Rank Normalization. (Main Theorem)

###### Proof.

Here we provide the proof of Section 3.1 (in the main paper) for all the three cases with optimality and uniqueness guarantees. Let be the optimal solution to the problem for any of the two cases. From creftype 6, the of and can be written as and , respectively. Then, . From now onwards, we denote and as vectors consisting of the diagonal entries, and as the vector inner product 666 represents the Frobenius inner product of two matrices, which in the case of diagonal matrices is the same as the inner product of the diagonal vectors..

#### Proof for Case (a):

In this case, there is no constraint enforced to preserve any of the singular values of the given matrix while obtaining the new one. The only constraint is that the new matrix should have the stable rank of . Let us assume , , and . Using these notations, we can write as:

 L =⟨Σ,Σ⟩+⟨Λ,Λ⟩−2⟨Σ,Λ⟩ (5)

Using the stable rank constraint , which is , we obtain the following equality constraint making the problem non-convex

 λ21=⟨Λ2,Λ2⟩r−1 (6)

However, we will show that the solution we obtain is optimal and unique. Substituting Eq. 6 into Eq. 5

 L=⟨Σ,Σ⟩+⟨Λ2,Λ2⟩r−1+⟨Λ2,Λ2⟩−2σ1√⟨Λ2,Λ2⟩r−1−2⟨Σ2,Λ2⟩ (7)

Setting to get the family of critical points

 ⟹Σ2=Λ2(1r−1+1−σ11√(r−1)⟨Λ2,Λ2⟩) (8)

The above equality implies that all the critical points of Eq. 7 are a scalar multiple of , implying, . Substituting this into Eq. 8 we obtain

 Σ2=γ2Σ2(1r−1+1−σ1γ2√(r−1)⟨Σ2,Σ2⟩)

Using the fact that in the above equality and with some algebraic manipulations, we obtain where, . Note, , , and , implying, . The uniqueness of is shown in creftype 7. Using and in Eq. 6, we obtain a unique solution .

#### Proof for Case (b):

In this case, the constraints are meant to preserve the top singular values of the given matrix while obtaining the new one. Let . Since satisfying all the constraints imply , thus, . From the stable rank constraint , we have

 r =⟨Λ1,Λ1⟩+⟨Λ2,Λ2⟩λ21 ∴⟨Λ2,Λ2⟩ (9)

The above equality constraint makes the problem non-convex. Thus, we relax it to to make it a convex problem and show that the optimality is achieved with equality. Let . Then, the relaxed problem can be written as

 minΛ2∈Rp−kL:=⟨Σ2−Λ2,Σ2−Λ2⟩ s.t.Λ2≥0,⟨Λ2,Λ2⟩≤η.

We introduce the Lagrangian dual variables and corresponding to the positivity and the stable rank constraints, respectively. The Lagrangian can then be written as

 L(Λ2,Γ,μ)Γ≥0,μ≥0=⟨Σ2−Λ2,Σ2−Λ2⟩+μ(⟨Λ2,Λ2⟩−η)−⟨Γ,Λ2⟩ (10)

Using the primal optimality condition , we obtain

 2Λ2−2Σ2+2μΛ2−Γ=0 ⟹Λ2=Γ+2Σ22(1+μ) (11)

Using the above condition on with the constraint , combined with the stable rank constraint of the given matrix that comes with the problem definition, (which implies ), the following inequality must be satisfied for any

 1<⟨Σ2,Σ2⟩η≤⟨Γ+Σ2,Γ+Σ2⟩η≤(1+μ)2 (12)

For the above inequality to satisfy, the dual variable must be greater than zero, implying, must be zero for the complementary slackness to satisfy. Using this with the optimality condition Eq. 11 we obtain

 (1+μ)2 =⟨Γ+2Σ2,Γ+2Σ2⟩4η

Substituting the above solution back into the primal optimality condition we get

 (13)

Finally, we use the complimentary slackness condition 777 is the hadamard product to get rid of the dual variable as follows

 Γ⊙(Γ+2Σ2)√η√⟨Γ+2Σ2,Γ+2Σ2⟩ =0

It is easy to see that the above condition is satisfied only when as and . Therefore, using in Eq. 13 we obtain the optimal solution of as

 Λ2=√η√⟨Σ2,Σ2⟩Σ2=√rσ21−∥S1∥2F∥S2∥2FΣ2=γΣ2 (14)

#### Proof for Case (c):

The monotonicity of for is shown in creftype 4. ∎

Note that by the assumption that , we can say that . Therefore in all the cases . Let us look at the required conditions for to hold. When , holds. When , for to be true, should hold, implying, , which is always true as (by the definition of stable rank).

###### Lemma 4.

For , the solution to the optimization problem Eq. 4 obtained using Theorem 3.1 is closest to the original matrix in terms of Frobenius norm when only the spectral norm is preserved, implying, .

###### Proof.

For a given matrix and a partitioning index , let be the matrix obtained using Theorem 3.1. We use the superscript along with and to denote that this refers to the particular solution of . Plugging the value of and using the fact that , we can write

 ∥∥W−ˆWk∥∥F =∥∥Sk2∥∥F−√rσ21−∥∥Sk1∥∥2F =∥∥Sk2∥∥F−√rσ21−∥W∥2F+∥∥Sk2∥∥2F.

Thus, can be written in a simplified form as , where and . Note, as , and because of the condition in Section 3.1. Under these settings, it is trivial to verify that is a monotonically decreasing function of . Using the fact that as the partition index increases, decreases, it is straightforward to conclude that the minimum of is obtained at . ∎

### a.2 Example

Let us assume is a identity matrix (rank = stable rank = 3) and the objective is to obtain a new matrix with stable rank of . We consider three cases (a) as the solution to the rank minimization without stable rank constraint (Eckart-Young-Mirsky (Eckart and Young, 1936)); (b) as the solution of Section 3.1 with ; and (c) as the solution of Section 3.1 with . The solutions to these three cases can be computed as (use Section 3.1 for cases (b) and (c)):

 ˆW1=⎡⎢⎣100010000⎤⎥⎦,ˆW2=⎡⎢ ⎢ ⎢⎣10001√20001√2⎤⎥ ⎥ ⎥⎦,ˆW3=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣12(√2+1)00012√2(√2+1)00012√2(√2+1)⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

It is easy to verify that the stable rank of all the above solutions is . However, the Frobenius distance of these solutions from the original matrix follows the order . This example shows that the solution provided in Section 3.1, instead of completely removing a particular singular value, scales them (depending on ) such that the new matrix has the desired stable rank and is closest to the original matrix in terms of Frobenius norm. Interestingly, as shown in the example, in the case of , the spectral norm of the optimal solution is greater than that of the original matrix.

### a.3 Proof for Optimal Spectral Normalization

The widely used spectral normalization (Miyato et al., 2018) where the given matrix is divided by the maximum singular value is an approximation to the optimal solution of the spectral normalization problem defined as

 argminˆW ∥∥W−ˆW∥∥2F (15) s.t. σ(ˆW)≤s,

where denotes the maximum singular value and is a hyperparameter. The optimal solution to this problem is shown in Algorithm 3.

In what follows we provide the optimality proof of Algorithm 3 for the sake of completeness. Let and let us assume that is a solution to the problem 15. Trivially, also satisfies . Now, , where the last inequality directly comes from creftype 5. Thus the singular vectors of the optimal solution must be the same as that of . This boils down to solving the following problem

 argminΛ∈Rmin(m,n)+∥Λ−Σ∥2Fs.t.Λ[i]≤s\enskip∀i∈{0,min(m,n)}. (16)

Here, without loss of generality, we abuse notations by considering and to represent the diagonal vectors of the original diagonal matrices and , and as its -th index. It is trivial to see that the optimal solution with minimum Frobenius norm is achieved when

 Λ[i]={Σ[i],ifΣ[i]≤ss,otherwise.

This is exactly what Algorithm 3 implements.

### a.4 Auxiliary Lemmas

###### Lemma 5.

[Reproduced from Theorem 5 in Mirsky (1960)] For any two matrices with singular values as and , respectively

 ∥A−B∥2F≥n∑i=1(σi−ρi)2
###### Proof.

Consider the following symmetric matrices

 X=[0AA⊤0],Y=[0BB⊤0],Z=[0A−B(A−B)⊤0]

Let