Global Guarantees for Enforcing Deep Generative Priors by Empirical Risk

Global Guarantees for Enforcing Deep Generative Priors by Empirical Risk

Paul Hand   and Vladislav Voroninski Both authors contributed equallyDepartment of Computational and Applied Mathematics, Rice University, Houston, TXHelm.ai, Menlo Park, CA
Abstract

We examine the theoretical properties of enforcing priors provided by generative deep neural networks via empirical risk minimization. In particular we consider two models, one in which the task is to invert a generative neural network given access to its last layer and another which entails recovering a latent code in the domain of a generative neural network from compressive linear observations of its last layer. We establish that in both cases, in suitable regimes of network layer sizes and a randomness assumption on the network weights, that the non-convex objective function given by empirical risk minimization does not have any spurious stationary points. That is, we establish that with high probability, at any point away from small neighborhoods around two scalar multiples of the desired solution, there is a descent direction. These results constitute the first theoretical guarantees which establish the favorable global geometry of these non-convex optimization problems, and bridge the gap between the empirical success of deep learning and a rigorous understanding of non-linear inverse problems.

1 Introduction

Exploiting the structure of natural signals and images has proven to be a fruitful endeavor across many domains of science. Breaking with the dogma of the Nyquist sampling theorem, which stems from worst-case analysis, Candes and Tao, and Donoho [8, 10], provided a theory and practice of compressed sensing (CS), which exploits the sparsity of natural signals to design acquisition strategies whose sample complexity is on par with the sparsity level of the signal at hand. On a practical level, compressed sensing has lead to significant reduction in the sample complexity of signal acquisition of natural images, for instance speeding up MRI imaging by a factor of 10. Beyond MRI, compressed sensing has impacted many if not all imaging sciences, by providing a general tool to exploit the parsimony of natural signals to improve acquisition speed, increase SNR and reduce sample complexity. CS has also lead to the development of the fields of matrix completion [7], phase retrieval, [9, 6] and several other subfields [1] which analogously exploit the sparsity of singular values of low rank matrices as well as sparsity in some basis.

Meanwhile, the advent of practical deep learning has significantly improved meaningful compression of images and acoustic signals. For instance, deep learning techniques are now the state of the art across most of computer vision, and have taken the field far beyond where it stood just a few years prior. The success of deep learning ostensibly stems from its ability to exploit the hierarchical nature of images and other signals. There are many techniques and add-on architectural choices associated with deep learning, but many of them are non-essential from a theoretical and, to an extent, practical perspective, with simple convolutional deep nets with ReLUs achieving close to the state of the art performance on many tasks [15]. The class of functions represented by such deep networks is readily interpretable as hierarchical compression schemes with exponentially many linear filters, each being a linear combination of filters in earlier layers. Constructing such compression schemes by hand would be quite tedious if not impossible, and the biggest surprise of deep learning is that simple stochastic gradient descent (SGD) allows one to efficiently traverse this class of functions subject to highly non-convex learning objectives. While this latter property has been empirically established in an impressive number of applications, it has so far eluded a completely satisfactory theoretical explanation.

Optimizing over the weights of a neural network or inverting a neural network may both be interpreted as inverse problems [13]. Traditionally, rigorous understanding of inverse problems has been limited to the simpler setting in which the optimization objective is convex. More recently, there has been progress in understanding non-convex optimization objectives for inverse problems, in albeit analytically simpler situations than those involving multilayer neural networks. For instance, the authors of [16, 4] provide a global analysis of non-convex objectives for phase retrieval and community detection, respectively, ruling out adversarial geometries in these scenarios for the purposes of optimization.

Very recently, deep neural networks have been exploited to construct highly effective natural image priors, by training generative adversarial networks to find a Nash equilibrium of a non-convex game [11]. The resulting image priors have proven useful in inverting hidden layers of lossy neural networks [14] and performing super-resolution [12]. Naturally, one may ponder whether these generative priors may be leveraged to improve compressive sensing. Indeed, while natural images are sparse in the wavelet basis, a random sparse linear combination of wavelets is far less structured than say a real-world scene or a biological structure, illustrating that a generic sparsity prior is nowhere near tight. The generative priors provided by GANs have already been leveraged to improve compressed sensing in particular domains [3]. Remarkably, empirical results by [3] dictate that given a dataset of images from a particular class, one can perform compressed sensing with 10X fewer measurements than what the sparsity prior alone would permit in traditional CS. As GANs and other neural network-based priors improve in modeling more diverse datasets of images, many scenarios in compressed sensing will benefit analogously. Moreover, using generative priors to improve signal recovery in otherwise underdetermined settings is not limited to linear inverse problems, and in principle these benefits should carry over to any inverse problem in imaging science.

In this paper we present the first global analysis of empirical risk minimization for enforcing generative multilayer neural network priors. In particular we show that under suitable randomness assumptions on the weights of a neural network and successively expansive hidden layer sizes, the empirical risk objective for recovering a latent code in from linear observations of the last layer of a generative network, where is proportional to up to log factors, has no spurious local minima, in that there is a descent direction everywhere except possibly small neighborhoods around two scalar multiples of the desired solution. Our descent direction analysis is constructive and relies on novel concentration bounds of certain random matrices, uncovering some interesting geometrical properties of the landscapes of empirical risk objective functions for random ReLU’d generative multilayer networks. The tools developed in this paper may be of independent interest, and may in particular lead to global non-asymptotic guarantees regarding convergence of SGD for training deep neural networks.

2 Related theoretical work

In related work, the authors of [3] also rigorously study inverting compressive linear observations under generative priors, by proving a restricted eigenvalue condition on the range of the generative neural network. However, they only provide a guarantee that is local in nature, in showing the global optimum of empirical risk is close to the desired solution. The global properties of optimizing the objective function are not theoretically studied in [3]. In addition, [2] studied inverting neural networks given access to the last layer using an analytical formula that approximates the inverse mapping of a neural network. The results of [2] are in a setting where the neural net is not generative, and their procedure is at best approximate, and, since it requires observation of the last layer, it is not readily extendable to the compressive linear observation setting. Meanwhile, the optimization problem we study can yield exact recovery, which we observe empirically via gradient descent. Most importantly, in contrast to [3, 2], we provide a global analysis of the non-convex empirical risk objective function and constructively exhibit a descent direction at every point outside a neighborhood of the desired solution and a negative scalar multiple of it. Our guarantees are non-asymptotic, and to the best of our knowledge the first of their kind.

3 Main Results

3.1 Generative priors with two layers

We consider the nonlinear inverse problem of recovering a signal from linear measurements of the last layer of a generative neural network with two layers, gaussian weights and a ReLU activation function at each layer. Let be the dimensionality of the latent code space. Let and be the number of neurons in the first and second layers, respectively. Let and be the weights of the neural network at the first and second layers. Let be a matrix that linearly subsamples the output of the generative network. Here, we consider and . The problem at hand is:

Let:
Given:
Find:

We consider solving this signal recovery problem by minimization of empirical risk under an loss. That is, we study the minimization of

(1)

Note that is not differentiable, but its (one-sided) directional derivatives exist at all points and in all directions. Let be the (one-sided) directional derivative of in the direction . That is, let .

We show that if the neural network weights and sampling matrix are Gaussian and independent from each other, then the empirical risk function has a strict descent direction at any non-zero point outside of a neighborhood of and , with high probability, provided that and and . Specifically, we show that a descent direction is given by , where

(2)

with and , where we take to be a diagonal matrix with the entries of a vector on the diagonal. This expression is the gradient of where is differentiable. Where is not differentiable, this expression is the gradient of the terms in where all arguments to the ReLU functions are strictly positive. Note that it can be computed at arbitrary knowing only , , and the observations . Additionally, we show that is a local max by establishing that the directional derivative of at zero is negative in all directions.

Theorem 1.

Fix . Let , , and have i.i.d. Gaussian entries with mean 0 and variances , and , respectively. If and and , then with probability at least , we have

simultaneously for all . Here, and are constants that depend only on , and is a universal constant.

In particular, by taking large enough and sufficiently successively expansive layers with observations, we can guarantee strict descent directions of the empirical risk objective everywhere but two arbitrarily small neighborhoods of two scalar multiples of .

The case of inversion of a two-layer network follows by taking the matrix to be identity, or alternatively by taking .

Corollary 2.

(Inversion of Two Layer Network) Under the same assumptions as Theorem 1, but with and , we have the same result if and , with probability at least .

The proofs of this theorem and corollary are in Section 6.3.

Note that the theorem assumes each layer has at least a logarithmic factor more neurons than the layer before it. For the probability bounds to be nontrivial, each layer can not have more than exponentially many more neurons than the prior layer. This theorem establishes that if an algorithm for minimizing (1) finds a stationary point, then that point is near or with high probability, provided that there are a sufficient number of neurons at each layer and provided that the number of linear observations of the last layer is . Note that the number of required random linear observations of the last layer is proportional up to log factors to the dimension of the latent code space, which may be significantly smaller than the embeddings space of the last layer. Thus our guarantees hold in the regime of genuine compressive sensing under a generative prior.

3.2 Generative priors with layers

We also consider the nonlinear inverse problem of recovering a signal from linear measurements about the last layer of a random generative neural network with layers, with ReLU activations at each layer. Let be the dimensionality of the latent code space. Let be the number of neurons in the layer. Let be the weights of the neural network at the th layer. Let be a matrix that linearly subsamples the output of the generative network. Here, we consider and . The problem at hand is:

Let:
Given:
Find:

We study the minimization of

We show that if the neural network weights and the entries of the sampling matrix are independent gaussians, then the empirical risk function has a strict descent direction at any point outside of a neighborhood of and a negative multiple of , with high probability, provided that and . Specifically, we show that a descent direction is given by

(3)

where , , and

Theorem 3.

Fix and . If for all and , then with probability at least , we have

simultaneously for all , where is a -dependent positive number. Here, are constants that depend only on .

In particular, for a fixed , by taking large enough and sufficiently successively expansive layers with observations, we can guarantee strict descent directions of the empirical risk objective everywhere but two arbitrarily small neighborhoods of two scalar multiples of .

The case of inversion of a -layer network follows by taking the matrix to be identity, or alternatively by taking .

Corollary 4.

(Inversion of Multiple Layer Network) Under the same assumptions as Theorem 1, but with and , we have the same result if , with probability at least .

The proofs of this theorem and corollary are in the supplemental materials and parallels those of Theorem 1.

This theorem extends Theorem 1 to the case of an arbitrary number of layers. It establishes that for a generative prior with an arbitrary number of fixed layers, if there are a sufficient number of measurements and if the size of each layer increases fast enough, then an algorithm that finds a stationary point of (1) well either converge near or near a negative multiple of with high probability. As with the two layer case, signal recovery is possible when the output of the generative model is significantly under sampled, achieving genuine compressive sensing.

4 Notation

Let . Let apply entrywise for . Let be the diagonal matrix that is 1 in the th entry and 0 otherwise. Let be the Euclidean ball of radius centered at . Let . Let Let . Let . Let . Let . Let be the identity matrix. Let mean that is a positive semidefinite matrix. For matrices , let be the spectral norm of . Let be the unit sphere in . Let and For any nonzero , let . For fixed , let is the matrix such that , , and for all For a set , let denote its cardinality. Let be the set of indices corresponding to nonzero rows of .

5 Organization of this paper

In Section 6, we prove Theorem 1 and state the technical lemmas used in its proof. We additionally state the technical lemmas used in the proof of Theorem 3. In the appendix, we prove the technical lemmas and Theorem 3. Most Lemmas are stated twice with two different numbers.

6 Lemma Statements and Proof of Theorem 1

We prove Theorems 1 and 3 using arguments from probabilistic concentration theory. Specifically, we show that the descent vector concentrates around its expectation. Hence, is nonzero with high probability in regions where its expectation is not small in norm.

6.1 Central Technical Lemma

The central technical lemma concerns concentration of the matrices and uniformly over all . For any nonzero , let . Define

(4)

where , , and is the matrix such that , , and for all An elementary calculation shows that if has i.i.d. entries, then

Lemma 5 (Also Lemma 14).

Fix . Let have i.i.d. entries. If , then with probability at least ,

In the special case that , we have

Here are constants that depend only on .

6.2 Technical Lemmas for proof of Theorem 1

The central result about matrix concentration allows us to prove concentration of matrices of the form and of vectors . These are established in the next two lemmas.

Lemma 6 (Also Lemma 24).

Fix . Let have i.i.d. entries. Let have i.i.d. entries. If and if , then with probability at least

Here, and are constants that depend only on .

Lemma 7 (Also Lemma 25).

Fix . Let have i.i.d. entries. Let have i.i.d. entries. If and if , then with probability at least , for all and ,

where . Here, and are constants that depend only on .

The following lemma is a matrix RIP-like condition for acting on the range of for all . Note that when . We note that a closely related statement to Lemma 8 appears as Lemma 4.2 in [3]. We require a stronger condition, which we state as follows.

Lemma 8 (Also Lemma 18).

Fix . Let have i.i.d. entries. Let have i.i.d. entries. Let have i.i.d. entries. If , then with probability at least ,

(5)

Here, and are constants that depend only on .

We must also establish that acts like an isometry for certain vectors given by the points of non-differentiability of . Letting . That is, keeps only the rows of that are orthogonal to .

Lemma 9 (Also Lemma 21).

Fix . Let have i.i.d. entries. Let have i.i.d. entries. Let have i.i.d. entries. If , then with probability at least ,

(6)

where

Here, and are constants that depend only on .

In order to bound the spectral norm of and , we bound the spectral norm of matrices formed by selecting arbitrary subsets of a fraction of rows of a Gaussian matrix.

Lemma 10 (Also Lemma 28).

Let

where is an integer greater than and let and be defined by

(7)

where is the angle between and , and we assume is bounded above by some universal constant. Define

where and is defined iteratively by (7). If , then we have that either

or

In particular, we have

6.3 Proof of Theorem 1

Using the Lemmas in the previous section, we may prove the theorem in the two-layer case.

Proof of Theorem 1.

Let

It suffices to prove . For brevity in notation we will let .

For a fixed and sufficiently small , we have , , and . Further, we have for sufficiently small that

Let be the event that . By Lemma 6, if and , then .

Let be the event that , where . By Lemma 7, if and , then .

Let be the event that By Lemma 8, if , then .

Let be the event that ,

By Lemma 9, if , then .

Let be the event that and . By Corollary 5.35 in [17], if and .

Let be the event that , and let be the event that . Because any independent Gaussians are linearly independent with probability 1, Lemma 11 provides that there exists such that if , then . Similarly, if , then .

We can now compute the directional derivative of in the direction of . On ,

(8)

where the fifth line follows from the definition of , and ; and the last line follows from the triangle inequality, and for all , and .

Now we provide a lower bound for . For any , on , we have