Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy?

# Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy?

Raja Giryes School of Electrical Engineering, Faculty of Engineering, Tel-Aviv University, Ramat Aviv 69978, Israel.       {raja@tauex, bron@eng}.tau.ac.il Guillermo Sapiro Department of Electrical and Computer Engineering, Duke University, Durham, North Carolina, 27708, USA. {guillermo.sapiro}@duke.edu. Alex M. Bronstein School of Electrical Engineering, Faculty of Engineering, Tel-Aviv University, Ramat Aviv 69978, Israel.       {raja@tauex, bron@eng}.tau.ac.il
###### Abstract

Three important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.

## I Introduction

Deep neural networks (DNN) have led to a revolution in the areas of machine learning, audio analysis, and computer vision, achieving state-of-the-art results in numerous applications [1, 2, 3]. In this work we formally study the properties of deep network architectures with random weights applied to data residing in a low dimensional manifold. Our results provide insights into the outstanding empirically observed performance of DNN, the role of training, and the size of the training data.

Our motivation for studying networks with random weights is twofold. First, a series of works [4, 5, 6] empirically showed successful DNN learning techniques based on randomization. Second, studying a system with random weights rather than learned deterministic ones may lead to a better understanding of the system even in the deterministic case. For example, in the field of compressed sensing, where the goal is to recover a signal from a small number of its measurements, the study of random sampling operators led to breakthroughs in the understanding of the number of measurements required for achieving a stable reconstruction [7]. While the bounds provided in this case are universally optimal, the introduction of a learning phase provides a better reconstruction performance as it adapts the system to the particular data at hand [8, 9, 10]. In the field of information retrieval, random projections have been used for locality-sensitive hashing (LSH) scheme capable of alleviating the curse of dimensionality for approximate nearest neighbor search in very high dimensions [11]. While the original randomized scheme is seldom used in practice due to the availability of data-specific metric learning algorithms, it has provided many fruitful insights. Other fields such as phase retrieval, gained significantly from a study based on random Gaussian weights [12].

Notice that the technique of proving results for deep learning with assumptions on some random distribution and then showing that the same holds in the more general case is not unique to our work. On the contrary, some of the stronger recent theoretical results on DNN follow this path. For example, Arora et al. analyzed the learning of autoencoders with random weights in the range , showing that it is possible to learn them in polynomial time under some restrictions on the depth of the network [13]. Another example is the series of works [14, 15, 16] that study the optimization perspective of DNN.

In a similar fashion, in this work we study the properties of deep networks under the assumption of random weights. Before we turn to describe our contribution, we survey previous studies that formally analyzed the role of deep networks.

Hornik et al. [17] and Cybenko [18] proved that neural networks serve as a universal approximation for any measurable Borel functions. However, finding the network weights for a given function was shown to be NP-hard.

Bruna and Mallat proposed the wavelet scattering transform– a cascade of wavelet transform convolutions with nonlinear modulus and averaging operators [19]. They showed for this deep architecture that with more layers the resulting features can be made invariant to increasingly complex groups of transformations. The study of the wavelet scattering transform demonstrates that deeper architectures are able to better capture invariant properties of objects and scenes in images

Anselmi et al. showed that image representations invariant to transformations such as translations and scaling can considerably reduce the sample complexity of learning and that a deep architecture with filtering and pooling can learn such invariant representations [20]. This result is particularly important in cases that training labels are scarce and in totally unsupervised learning regimes.

Montúfar and Morton showed that the depth of DNN allows representing restricted Boltzmann machines with a number of parameters exponentially greater than the number of the network parameters [21]. Montúfar et al. suggest that each layer divides the space by a hyper-plane [22]. Therefore, a deep network divides the space into an exponential number of sets, which is unachievable with a single layer with the same number of parameters.

Bruna et al. showed that the pooling stage in DNN results in shift invariance [23]. In [24], the same authors interpret this step as the removal of phase from a complex signal and show how the signal may be recovered after a pooling stage using phase retrieval methods. This work also calculates the Lipschitz constants of the pooling and the rectified linear unit (ReLU) stages, showing that they perform a stable embedding of the data under the assumption that the filters applied in the network are frames, e.g., for the ReLU stage there exist two constants such that for any ,

 A∥x−y∥2≤∥ρ(Mx)−ρ(My)∥2≤B∥x−y∥2, (1)

where denotes the linear operator at a given layer in the network with and denoting the input and output dimensions, respectively, and is the ReLU operator applied element-wise. However, the values of the Lipschitz constants and in real networks and their behavior as a function of the data dimension currently elude understanding. To see why such a bound may be very loose, consider the output of only the linear part of a fully connected layer with random i.i.d. Gaussian weights, , which is a standard initialization in deep learning. In this case, and scale like respectively [25]. This undesired behavior is not unique to a normally-distributed , being characteristic of any distribution with a bounded fourth moment. Note that the addition of the non-linear operators, the ReLU and the pooling, makes these Lipschitz constants even worse.

### I-a Contributions

As the former example teaches, the scaling of the data introduced by may drastically deform the distances throughout each layer, even in the case where is very close to , which makes it unclear whether it is possible to recover the input of the network (or of each layer) from its output. In this work, the main question we focus on is: What happens to the metric of the input data throughout the network? We focus on the above mentioned setting, assuming that the network has random i.i.d. Gaussian weights. We prove that DNN preserve the metric structure of the data as it propagates along the layers, allowing for the stable recovery of the original data from the features calculated by the network.

This type of property is often encountered in the literature [24, 26]. Notice, however, that the recovery of the input is possible if the size of the network output is proportional to the intrinsic dimension of the data at the input (which is not the case at the very last layer of the network, where we have class labels only), similarly to data reconstruction from a small number of random projections [27, 28, 29]. However, unlike random projections that preserve the Euclidean distances up to a small distortion [30], each layer of DNN with random weights distorts these distances proportionally to the angles between its input points: the smaller the angle at the input, the stronger the shrinkage of the distances. Therefore, the deeper the network, the stronger the shrinkage we get. Note that this does not contradict the fact that we can recover the input from the output; even when properties such as lighting, pose and location are removed from an image (up to certain extent), the resemblance to the original image is still maintained.

As random projection is a universal sampling strategy for any low dimensional data [7, 29, 31], deep networks with random weights are a universal system that separates any data (belonging to a low dimensional model) according to the angles between its points, where the general assumption is that there are large angles between different classes [32, 33, 34, 35]. As training of the projection matrix adapts it to better preserve specific distances over others, the training of a network prioritizes intra-class angles over inter-class ones. This relation is alluded by our proof techniques and is empirically manifested by observing the angles and the Euclidean distances at the output of trained networks, as demonstrated later in the paper in Section VI.

The rest of the paper is organized as follows: In Section II, we start by utilizing the recent theory of 1-bit compressed sensing to show that each DNN layer preserves the metric of its input data in the Gromov-Hausdorff sense up to a small constant , under the assumption that these data reside in a low-dimensional manifold denoted by . This allows us to draw conclusions on the tessellation of the space created by each layer of the network and the relation between the operation of these layers and local sensitive hashing (LSH) [11]. We also show that it is possible to retrieve the input of a layer, up to certain accuracy, from its output. This implies that every layer preserves the important information of the data.

In Section III, we proceed by analyzing the behavior of the Euclidean distances and angles in the data throughout the network. This section reveals an important effect of the ReLU. Without the ReLU, we would just have random projections and Euclidean distance preservation. Our theory shows that the addition of ReLU makes the system sensitive to the angles between points. We prove that networks tend to decrease the Euclidean distances between points with a small angle between them (“same class”), more than the distances between points with large angles between them (“different classes”).

Then, in Section IV we prove that low-dimensional data at the input remain such throughout the entire network, i.e., DNN (almost) do not increase the intrinsic dimension of the data. This property is used in Section V to deduce the size of data needed for training DNN.

We conclude by studying the role of training in Section VI. As random networks are blind to the data labels, training may select discriminatively the angles that cause the distance deformation. Therefore, it will cause distances between different classes to increase more than the distances within the same class. We demonstrate this in several simulations, some of which with networks that recently showed state-of-the-art performance on challenging datasets, e.g., the network by [36] for the ImageNet dataset [37]. Section VII concludes the paper by summarizing the main results and outlining future research directions.

It is worthwhile emphasizing that the assumption that classes are separated by large angles is common in the literature (see [32, 33, 34, 35]). This assumption can further refer to some feature space rather than to the raw data space. Of course, some examples might be found that contradict this assumption such as the one of two concentric spheres, where each sphere represents a different class. With respect to such particular examples two things should be said: First, these cases are rare in real life signals, typically exhibiting some amount of scale invariance that is absent in this example. Second, we prove that the property of discrimination based on angles holds for DNN with random weights and conjecture in Section VI that a potential role (or consequence) of training in DNN is to favor certain angles over others and to select the origin of the coordinate system with respect to which the angles are calculated. We illustrate in Section VI the effect of training compared to random networks, using trained DNN that have achieved state-of-the-art results in the literature. Our claim is that DNN are suitable for models with clearly distinguishable angles between the classes if random weights are used, and for classes with some distinguishable angles between them if training is used.

For the sake of simplicity of the discussion and presentation clarity, we focus only on the role of the ReLU operator [38], assuming that our data are properly aligned, i.e., they are not invariant to operations such as rotation or translation, and therefore there is no need for the pooling operation to achieve invariance. Combining recent results for the phase retrieval problem with the proof techniques in this paper can lead to a theory also applicable to the pooling operation. In addition, with the strategy in [39, 40, 41, 42], it is possible to generalize our guarantees to sub-Gaussian distributions and to random convolutional filters. We defer these natural extensions to future studies.

## Ii Stable Embedding of a Single Layer

In this section we consider the distance between the metrics of the input and output of a single DNN layer of the form , mapping an input vector to the output vector , where is an random Gaussian matrix and is a semi-truncated linear function applied element-wise. is such if it is linear on some (possibly, semi-infinite) interval and constant outside of it, , , and . The ReLU, henceforth denoted by , is an example of such a function, while the sigmoid and the hyperbolic tangent functions satisfy this property approximately.

We assume that the input data belong to a manifold with the Gaussian mean width

 ω(K):=E[supx,y∈K⟨g,x−y⟩], (2)

where the expectation is taken over a random vector with normal i.i.d. elements. To better understand this definition, note that is the width of the set in the direction of as illustrated in Fig. 1. The mean provides an average over the widths of the set in different isotropically distributed directions, leading to the definition of the Gaussian mean width .

The Gaussian mean width is a useful measure for the dimensionality of a set. As an example we consider the following two popular data models:

#### Gaussian mixture models

(the -ball) consists of Gaussians of dimension . In this case .

#### Sparsely representable signals

The data in can be approximated by a sparse linear combination of atoms of a dictionary, i.e., , where is the pseudo-norm that counts the number of non-zeros in a vector and is the given dictionary. For this model .

Similar results can be shown for other models such as union of subspaces and low dimensional manifolds. For more examples and details on , we refer the reader to [29, 43].

We now show that each standard DNN layer performs a stable embedding of the data in the Gromov-Hausdorff sense [44], i.e., it is a -isometry between and , where and are the manifolds of the input and output data and and are metrics induced on them. A function is a -isometry if

 |dK′(h(x),h(y))−dK(x,y)|≤δ,∀x,y∈K, (3)

and for every there exists such that (the latter property is sometimes called -surjectivity). In the following theorem and throughout the paper denotes a given constant (not necessarily the same one) and the unit sphere in .

###### Theorem 1

Let be the linear operator applied at a DNN layer, a semi-truncated linear activation function, and the manifold of the input data for that layer. If is a random matrix with i.i.d normally distributed entries, then the map defined by is with high probability a -isometry, i.e.,

 ∣∣dSn−1(x,y)−dHm(g(x),g(y))∣∣≤δ, ∀x,y∈K,

with .

In the theorem is the geodesic distance on , is the Hamming distance and the sign function, , is applied elementwise, and is defined as if and if . The proof of the approximate injectivity follows from the proof of Theorem 1.5 in [43].

Theorem 1 is important as it provides a better understanding of the tessellation of the space that each layer creates. This result stands in line with [22] that suggested that each layer in the network creates a tessellation of the input data by the different hyper-planes imposed by the rows in . However, Theorem 1 implies more than that. It implies that each cell in the tessellation has a diameter of at most (see also Corollary 1.9 in [43]), i.e., if and fall to the same side of all the hyperplanes, then . In addition, the number of hyperplanes separating two points and in contains enough information to deduce their distance up to a small distortion. From this perspective, each layer followed by the sign function acts as locality-sensitive hashing [11], approximately embedding the input metric into the Hamming metric.

Having a stable embedding of the data, it is natural to assume that it is possible to recover the input of a layer from its output. Indeed, Mahendran and Vedaldi demonstrate that it is achievable through the whole network [26]. The next result provides a theoretical justification for this, showing that it is possible to recover the input of each layer from its output:

###### Theorem 2

Under the assumptions of Theorem 1 there exists a program such that , where .

The proof follows from Theorem 1.3 in [29]. If is a cone then one may use also Theorem 2.1 in [45] to get a similar result.

Both theorems 1 and 2 are applying existing results from 1-bit compressed sensing on DNN. Theorem 1 deals with embedding into the Hamming cube and Theorem 2 uses this fact to show that we can recover the input from the output. Indeed, Theorem 1 only applies to an individual layer and cannot be applied consecutively throughout the network, since it deals with embedding into the Hamming cube. One way to deal with this problem is to extend the proof in [43] to an embedding from to . Instead, we turn to focus on the ReLU and prove more specific results about the exact distortion of angles and Euclidean distances. These also include a proof about a stable embedding of the network at each layer from to .

## Iii Distance and Angle Distortion

So far we have focused on the metric preservation of the deep networks in terms of the Gromov-Hausdorff distance. In this section we turn to look at how the Euclidean distances and angles change within the layers. We focus on the case of ReLU as the activation function. A similar analysis can also be applied for pooling. For the simplicity of the discussion we defer it to future study.

Note that so far we assumed that is normalized and lies on the sphere . Given that the data at the input of the network lie on the sphere and we use the ReLU as the activation function, the transformation keeps the output data (approximately) on a sphere (with half the diameter, see (31) in the proof of Theorem 3 in the sequel). Therefore, in this case the normalization requirement holds up to a small distortion throughout the layers. This adds a motivation for having a normalization stage at the output of each layer, which was shown to provide some gain in several DNN [1, 46].

Normalization, which is also useful in shallow representations [47], can be interpreted as a transformation making the inner products between the data vectors coincide with the cosines of the corresponding angles. While the bounds we provide in this section do not require normalization, they show that the operations of each layer rely on the angles between the data points.

The following two results relate the Euclidean and angular distances in the input of a given layer to the ones in its output. We denote by the Euclidean ball of radius .

###### Theorem 3

Let be the linear operator applied at a given layer, (the ReLU) be the activation function, and be the manifold of the data in the input layer. If is a random matrix with i.i.d. normally distributed entries and , then with high probability

 ∣∣∣∥ρ(Mx)−ρ(My)∥22− (4)

where and

 ψ(x,y)=1π(sin∠(x,y)−∠(x,y)cos∠(x,y)). (5)
###### Theorem 4

Under the same conditions of Theorem 3 and , where , with high probability

 ∣∣∣cos∠(ρ(Mx),ρ(My))− (6) (cos∠(x,y)+ψ(x,y))∣∣∣≤15δβ2−2δ.

Remark 1 As we have seen in Section II, the assumption implies if is a GMM and if is generated by -sparse representations in a dictionary . As we shall see later in Theorem 6 in Section IV, it is enough to have the model assumption only at the data in the input of the DNN. Finally, the quadratic relationship between and might be improved.

We leave the proof of these theorems to Appendices A and B, and dwell on their implications.

Note that if were equal to zero, then theorems 3 and 4 would have stated that the distances and angles are preserved (in the former case, up to a factor of 111More specifically, we would have . Notice that this is the expectation of the distance .) throughout the network. As can be seen in Fig. 2, behaves approximately like . The larger is the angle , the larger is , and, consequently, also the Euclidean distance at the output. If the angle between and is close to zero (i.e., close to 1), vanishes and therefore the Euclidean distance shrinks by half throughout the layers of the network. We emphasize that this is not in contradiction to Theorem 1 which guarantees approximately isometric embedding into the Hamming space. While the binarized output with the Hamming metric approximately preserves the input metric, the Euclidean metric on the raw output is distorted.

Considering this effect on the Euclidean distance, the smaller the angle between and , the larger the distortion to this distance at the output of the layer, and the smaller the distance turns to be. On the other hand, the shrinkage of the distances is bounded as can be seen from the following corollary of Theorem 3.

###### Corollary 5

Under the same conditions of Theorem 3, with high probability

 12∥x−y∥22−δ≤∥ρ(Mx)−ρ(My)∥22≤∥x−y∥22+δ. (7)

The proof follows from the inequality of arithmetic and geometric means and the behavior of (see Fig. 2). We conclude that DNN with random Gaussian weights preserve local structures in the manifold and enable decreasing distances between points away from it, a property much desired for classification.

The influence of the entire network on the angles is slightly different. Note that starting from the input of the second layer, all the vectors reside in the non-negative orthant. The cosine of the angles is translated from the range in the first layer to the range in the subsequent second layers. In particular, the range is translated to the range , and in terms of the angle from the range to . These angles shrink approximately by half, while the ones that are initially small remain approximately unchanged.

The action of the network preserves the order between the angles. Generally speaking, the network affects the angles in the range in the same way. In particular, in the range the angles in the output of the layer behave like and in the wider range they are bounded from below by (see Fig. 3). Therefore, we conclude that the DNN distort the angles in the input manifold similarly and keep the general configuration of angles between the points.

To see that our theory captures the behavior of DNN endowed with pooling, we test how the angles change through the state-of-the-art -layers deep network trained in [36] for the ImageNet dataset. We randomly select angles (pairs of data points) in the input of the network, partitioned to three equally-sized groups, each group corresponding to a one of the ranges , and . We test their behavior after applying eight and sixteen non-linear layers. The latter case corresponds to the part of the network excluding the fully connected layers. We denote by the vector in the output of a layer corresponding to the input vector . Fig. 4 presents a histogram of the values of the angles at the output of each of the layers for each of the three groups. It shows also the ratio and difference , between the angles at the output of the layers and their original value at the input of the network.

As Theorem 4 predicts, the ratio corresponding to is half the ratio corresponding to input angles in the range . Furthermore, the ratios in the ranges and are approximately the same, where in the range they are slightly larger. This is in line with Theorem 4 that claim that the angles in this range decrease approximately in the same rate, where for larger angles the shrink is slightly larger. Also note that according to our theory the ratio corresponding to input angles in the range should behave on average like , where is the number of layers. Indeed, for , and for , ; the centers of the histograms for the range are very close to these values. Notice that we have a similar behavior also for the range . This is not surprising, as by looking at Fig. 3 one may observe that these angles also turn to be in the range that has the ratio . Remarkably, Fig. 4 demonstrates that the network keeps the order between the angles as Theorem 4 suggests. Notice that the shrinkage of the angles does not cause large angles to become smaller than other angles that were originally significantly smaller than them. Moreover, small angles in the input remain small in the output as can be seen in Fig. 4(right).

We sketch the distortion of two sets with distinguishable angle between them by one layer of DNN with random weights in Fig. 5. It can be observed that the distance between points with a smaller angle between them shrinks more than the distance between points with a larger angle between them. Ideally, we would like this behavior, causing points belonging to the same class to stay closer to each other in the output of the network, compared to points from different classes. However, random networks are not selective in this sense: if a point forms the same angle with a point from its class and with a point from another class, then their distance will be distorted approximately by an equal amount. Moreover, the separation caused by the network is dependent on the setting of the coordinate system origin with respect to which the angles are calculated. The location of the origin is dependent on the bias terms (in this case each layer is of the form ), which are set to zero in the random networks here studied. These are learned by the training of the network, affecting the angles that cause the distortions of the Euclidean (and angular) distances. We demonstrate the effect of training in Section VI.

## Iv Embedding of the Entire Network

In order to show that the results in sections II and III also apply to the entire network and not only to one layer, we need to show that the Gaussian mean width does not grow significantly as the data propagate through the layers. Instead of bounding the variation of the Gaussian mean width throughout the network, we bound the change in the covering number , i.e., the smallest number of -balls of radius that cover . Having the bound on the covering number, we use Dudley’s inequality [48],

 ω(K)≤C∫∞0√logNϵ(K)dϵ, (8)

to bound the Gaussian mean width.

###### Theorem 6

Under the assumptions of Theorem 1,

 Nϵ(f(MK))≤Nϵ/(1+ω(K)√m)(K). (9)

Proof: We divide the proof into two parts. In the first one, we consider the effect of the activation function on the size of the covering, while in the second we examine the effect of the linear transformation . Starting with the activation function, let be a center of a ball in the covering of and be a point that belongs to the ball of of radius , i.e., . It is not hard to see that since a semi-truncated linear activation function shrinks the data, then and therefore the size of the covering does not increase (but might decrease).

For the linear part we have that [49, Theorem 1.4]

 (10)

Therefore, after the linear operation each covering ball with initial radius is not bigger than . Since the activation function does not increase the size of the covering, we have that after a linear operation followed by an activation function, the size of the covering balls increases by a factor of . Therefore, the size of a covering with balls of radius of the output data is bounded by the size of a covering with balls of radius .

Theorem 6 generalizes the results in theorems 2, 3 and 4 such that they can be used for the whole network: there exists an algorithm that recovers the input of the DNN from its output; the DNN as a whole distort the Euclidean distances based on the angels of the input of the network; and the angular distances smaller than are not altered significantly by the network.

Note, however, that Theorem 6 does not apply to Theorem 1. In order to do that for the later, we need also a version of Theorem 1 that guarantees a stable embedding using the same metric at the input and the output of a given layer, e.g., embedding from the Hamming cube to the Hamming cube or from to . Indeed, we have exactly such a guarantee in Corollary 5 that implies stable embedding of the Euclidean distances in each layer of the network. Though this corollary focuses on the particular case of the ReLU, unlike Theorem 1 that covers more general activation functions, it implies stability for the whole network in the Lipschitz sense, which is even stronger than stability in the Gromov-Hausdorff sense that we would get by having the generalization of Theorem 1.

As an implication of Theorem 6, consider low-dimensional data admitting a Gaussian mixture model (GMM) with Gaussians of dimension or a -sparse represention in a given dictionary with atoms. For GMM, the covering number is for and otherwise (see [31]). Therefore we have that and that at each layer the Gaussian mean width grows at most by a factor of . In the case of sparsely representable data, . By Stirling’s approximation we have and therefore . Thus, at each layer the Gaussian mean width grows at most by a factor of .

## V Training Set Size

An important question in deep learning is what is the amount of labeled samples needed at training. Using Sudakov minoration [48], one can get an upper bound on the size of a covering of , , which is the number of balls of radius that include all the points in . We have demonstrated that networks with random Gaussian weights realize a stable embedding; consequently, if a network is trained using the screening technique by selecting the best among many networks generated with random weights as suggested in [4, 5, 6], then the number of data points needed in order to guarantee that the network represents all the data is . Since is a proxy for the intrinsic data dimension as we have seen in the previous sections (see [29] for more details), this bound formally predicts that the number of training points grows exponentially with the intrinsic dimension of the data.

The exponential dependency is too pessimistic, as it is often possible to achieve a better bound on the required training sample size. Indeed, the bound developed in [13] requires much less samples. As the authors study the data recovery capability of an autoencoder, they assume that there exists a ‘ground-truth autoencoder’ generating the data. A combination of the data dimension bound here provided, with a prior on the relation of the data to a deep network, should lead to a better prediction of the number of needed training samples. In fact, we cannot refrain from drawing an analogy with the field of sparse representations of signals, where the combined use of the properties of the system with those of the input data led to works that improved the bounds beyond the naïve manifold covering number (see [50] and references therein).

The following section presents such a combination, by showing empirically that the purpose of training in DNN is to treat boundary points. This observation is likely to lead to a significant reduction in the required size of the training data, and may also apply to active learning, where the training set is constructed adaptively.

## Vi The Role of Training

The proof of Theorem 3 provides us with an insight on the role of training. One key property of the Gaussian distribution, which allows it to keep the ratio between the angles in the data, is its rotational invariance. The phase of a random Gaussian vector with i.i.d. entries is a random vector with a uniform distribution. Therefore, it does not prioritize one direction in the manifold over the other but treats all the same, leading to the behavior of the angles and distances throughout the net that we have described above.

In general, points within the same class would have small angles within them and points from distinct classes would have larger ones. If this holds for all the points, then random Gaussian weights would be an ideal choice for the network parameters. However, as in practice this is rarely the case, an important role of the training would be to select in a smart way the separating hyper-planes induced by in such a way that the angles between points from different classes are ‘penalized more’ than the angles between the points in the same class.

Theorem 3 and its proof provide some understanding of how this can be done by the learning process. Consider the expectation of the Euclidean distance between two points and at the output of a given layer. It reads as (the derivation appears in Appendix A)

 (11) +∥x∥2∥y∥2π∫π−∠(x,y)0sin(θ)sin(θ+∠(x,y))dθ.

Note that the integration in this formula is done uniformly over the interval , which contains the range of directions that have simultaneously positive inner products with and . With learning, we have the ability to pick the angle that maximizes/minimizes the inner product based on whether and belong to the same class or to distinct classes and in this way increase/decrease their Euclidean distances at the output of the layer.

Optimizing over all the angles between all the pairs of the points is a hard problem. This explains why random initialization is a good choice for DNN. As it is hard to find the optimal configuration that separates the classes on the manifold, it is desirable to start with a universal one that treats most of the angles and distances well, and then to correct it in the locations that result in classification errors.

To validate this hypothesized behavior, we trained two DNN on the MNIST and CIFAR-10 datasets, each containing classes. The training of the networks was done using the matconvnet toolbox [51]. The MNIST and CIFAR-10 networks were trained with four and five layers, respectively, followed by a softmax operation. We used the default settings provided by the toolbox for each dataset, where with CIFAR-10 we also used horizontal mirroring and filters in the first two layers instead of (which is the default in the example provided with the package) to improve performance. The trained DNN achieve and errors for MNIST and CIFAR-10 respectively.

For each data point we calculate its Euclidean and angular distances to its farthest intra-class point and to its closest inter-class point. We compute the ratio between the distances at the output of the last convolutional layer (the input of the fully connected layers) and the ones at the input. Let be the point at the input, be its farthest point from the same class, be the point at the output, and be its farthest point from the same class (it should not necessarily be the output of ), then we calculate for Euclidean distances and for the angles. We do the same for the distances between different classes, comparing the shortest ones. Fig. 6 presents histograms of these distance ratios for CIFAR-10. In Fig. 7 we present the histograms of the differences of the Euclidean and angular distances, i.e., and . We also compare the behavior of all the inter and intra-class distances by computing the above ratios for all pairs of points in the input with respect to their corresponding points at the output. These ratios are presented in Fig. 8. We present also the differences and in Fig. 9. We present the results for three trained networks, in addition to the random one, denoted by Net1, Net2 and Net3. Each of them corresponds to a different amount of training epochs, resulting with a different classification error.

Considering the random DNN, note that all the histograms of the ratios are centered around and the ones of the differences around , implying that the network preserves most of the distances as our theorems predict for a network with random weights. For the trained networks, the histograms over all data point pairs (Figs. 8 and 9) change only slightly due to training. Also observe that the trained networks behave like their random counterparts in keeping the distance of a randomly picked pair of points. However, they distort the distances between points on class boundaries “better” than the random network (Figs. 6 and 7), in the sense that the farthest intra class distances are shrunk with a larger factor than the ones of the random network, and the closest inter class distances are set farther apart by the training. Notice that the shrinking of the distances within the class and enlargement of the distances between the classes improves as the training proceeds. This confirms our hypothesis that a goal of training is to treat the boundary points.

A similar behavior can be observed for the angles. The closest angles are enlarged more in the trained network compared to the random one. However, enlarging the angles between classes also causes the enlargement of the angles within the classes. Notice though that these are enlarged less than the ones which are outside the class. Finally, observe that the enlargement of the angles, as we have seen in our theorems, causes a larger distortion in the Euclidean distances. Therefore, we may explain the enlargement of the distances in within the class as a means for shrinking the intra-class distances.

Similar behavior is observed for the MNIST dataset. However, the gaps between the random network and the trained network are smaller as the MNIST dataset contains data which are initially well separated. As we have argued above, for such manifolds the random network is already a good choice.

We also compared the behavior of the validation data, of the ImageNet dataset, in the network provided by [36] and in the same network but with random weights. The results are presented in Figs. 10, 11, 12 and 13. Behavior similar to the one we observed in the case of CIFAR-10, is also manifested by the ImageNet network.

## Vii Discussion and Conclusion

We have shown that DNN with random Gaussian weights perform a stable embedding of the data, drawing a connection between the dimension of the features produced by the network that still keep the metric information of the original manifold, and the complexity of the data. The metric preservation property of the network provides a formal relationship between the complexity of the input data and the size of the required training set. Interestingly, follow-up studies [52, 53] found that adding metric preservation constraints to the training of networks also leads to a theoretical relation between the complexity of the data and the number of training samples. Moreover, this constraint is shown to improve in practice the generalization error, i.e., improves the classification results when only a small number of training examples is available.

While preserving the structure of the initial metric is important, it is vital to have the ability to distort some of the distances in order to deform the data in a way that the Euclidean distances represent more faithfully the similarity we would like to have between points from the same class. We proved that such an ability is inherent to the DNN architecture: the Euclidean distances of the input data are distorted throughout the networks based on the angles between the data points. Our results lead to the conclusion that DNN are universal classifiers for data based on the angles of the principal axis between the classes in the data. As these are not the angles we would like to work with in reality, the training of the DNN reveals the actual angles in the data. In fact, for some applications it is possible to use networks with random weights at the first layers for separating the points with distinguishable angles, followed by trained weights at the deeper layers for separating the remaining points. This is practiced in the extreme learning machines (ELM) techniques [54] and our results provide a possible theoretical explanation for the success of this hybrid strategy.

Our work implies that it is possible to view DNN as a stagewise metric learning process, suggesting that it might be possible to replace the currently used layers with other metric learning algorithms, opening a new venue for semi-supervised DNN. This also stands in line with the recent literature on convolutional kernel methods (see [55, 56]).

In addition, we observed that a potential main goal of the training of the network is to treat the class boundary points, while keeping the other distances approximately the same. This may lead to a new active learning strategy for deep learning [57].

Acknowledgments- Work partially supported by NSF, ONR, NGA, NSSEFF, and ARO. A.B. is supported by ERC StG 335491 (RAPID). The authors thank the reviewers of the manuscript for their suggestions which greatly improved the paper.

## Appendix A Proof of Theorem 3

Before we turn to prove Theorem 3, we present two propositions that will aid us in its proof. The first is the Gaussian concentration bound that appears in [48, Equation 1.6].

###### Proposition 7

Let be an i.i.d. Gaussian random vector with zero mean and unit variance, and be a Lipschitz-continuous function with a Lipschitz constant . Then for every , with probability exceeding ,

 |η(g)−E[η(g)]|<α. (12)
###### Proposition 8

Let be a random vector with zero-mean i.i.d. Gaussian distributed entries with variance , and be a set with a Gaussian mean width . Then, with probability exceeding ,

 supx,y∈K(ρ(mTx)−ρ(mTy))2<4ω(K)2m. (13)

Proof: First, notice that from the properties of the ReLU , it holds that

 (14)

Let be a scaled version of such that each entry in has a unit variance (as has a variance ). From the Gaussian mean width charachteristics (see Proposition 2.1 in [29]), we have . Therefore, combining the Gaussian concentration bound in Proposition 7 together with the fact that is Lipschitz-continuous with a constant (since ), we have

 ∣∣∣supx,y∈KgT(x−y)−ω(K)∣∣∣<α, (15)

with probability exceeding . Clearly, (15) implies

 supx,y∈KgT(x−y)≤2ω(K), (16)

where we set . Combining (16) and (14) with the fact that

 supx,y∈K(ρ(gTx)−ρ(gTy))2=(supx,y∈K(ρ(gTx)−ρ(gTy)))2

 supx,y∈K(ρ(gTx)−ρ(gTy))2<4ω(K)2. (17)

Dividing both sides by completes the proof.

Proof of Theorem 3: Our proof of Theorem 3 consists of three keys steps. In the first one, we show that the bound in (4) holds with high probability for any two points . In the second, we pick an -cover for and show that the same holds for each pair in the cover. The last generalizes the bound for any point in .

Bound for a pair : Denoting by the -th column of , we rewrite

 ∥ρ(Mx)−ρ(My)∥22=m∑i(ρ(mTix)−ρ(mTiy))2. (18)

Notice that since all the have the same distribution, the random variables are also equally-distributed. Therefore, our strategy would be to calculate the expectation of these random variables and then to show, using Bernstein’s inequality, that the mean of these random variables does not deviate much from their expectation.

We start by calculating their expectation

 E[(ρ(mTix)−ρ(mTiy))2] (19) =E(ρ(mTix))2+E(ρ(mTiy))2−2Eρ(mTix)ρ(mTiy).

For calculating the first term at the right hand side (rhs) note that is a random Gaussian vector with variance . Therefore, from the symmetry of the Gaussian distribution we have that

 E(ρ(mTix))2=12E(mTix)2=12m∥x∥22. (20)

In the same way, . For calculating the third term at the rhs of (19), notice that is non-zero if both the inner product between and and the inner product between and are positive. Therefore, the smaller the angle between and , the higher the probability that both of them will have a positive inner product with . Using the fact that a Gaussian vector is uniformly distributed on the sphere, we can calculate the expectation of by the following integral, which is dependent on the angle between and :

 E[ρ(mTix)ρ(mTiy)]= (21) ∥x∥2∥y∥2mπ∫π−∠(x,y)0sin(θ)sin(θ+∠(x,y))dθ =∥x∥∥y∥mπ(sin(∠(x,y))−cos(∠(x,y))∠(x,y)−π).

Having the expectation of all the terms in (19) calculated, we define the following random variable, which is the difference between and its expectation,

 zi≜(ρ(mTix)−ρ(mTiy))2−1m(12∥x−y∥22 (22) +∥x∥2∥y∥2π(sin(∠(x,y))−cos(∠(x,y))∠(x,y)).

Clearly, the random variable is zero-mean. To finish the first step of the proof, it remains to show that the sum does not deviate much from zero (its mean). First, note that

 P(∣∣ ∣∣m∑i=1zi∣∣ ∣∣>t)≤P(m∑i=1|zi|>t), (23)

and therefore it is enough to bound the term on the rhs of (23). By Bernstein’s inequality, we have

 P(m∑i=1|zi|>t)≤exp(−t2/2∑mi=1Ez2i+Mt/3), (24)

where is an upper bound on . To calculate , one needs to calculate the fourth moments and , which is easy to compute by using the symmetry of Gaussian vectors, and the correlations , and . For calculating the later, we use as before the fact that a Gaussian vector is uniformly distributed on the sphere and calculate an integral on an interval which is dependent on the angle between and . For example,

 E[ρ(mTix)3ρ(mTiy)]=4∥x∥32∥y∥2m2π⋅ (25) ∫π−∠(x,y)0sin3(θ)sin(θ+∠(x,y))dθ,

where is the angle between and . We have a similar formula for the other terms. By simple arithmetics and using the fact that , we have that .

The type of formula in (25), which is similar to the one in (21), provides an insight into the role of training. As random layers ‘integrate uniformly’ on the interval , learning picks the angle that maximizes/minimizes the inner product based on whether and belong to the same class or to distinct classes.

Using Proposition 8, the fact that and the behavior of (see Fig. 2) together with the triangle inequality imply that with probability exceeding . Plugging this bound with the one we computed for into (24) leads to

 P(m∑i=1|zi|>δ2) ≤ exp(−mδ2/82.12+4δω(K)2/3+δ) +2exp(−ω(K)2/4),

where we included the probability of Proposition 8 in the above bound. Since by the assumption of the theorem , we can write

 P(m∑i=1|zi|>δ2)≤Cexp(−mδ24w(K)2). (27)

Bound for all : Let be an cover for . By using a union bound we have that for every pair in ,

 P(m∑i=1|zi|>δ2)≤C|Nϵ(K)|2exp(−mδ24w(K)2). (28)

By Sudakov’s inequality we have . Plugging this inequality into (28) leads to

 P(m∑i=1|zi|>δ2)≤Cexp(−mδ24w(K)2+cϵ−2w(K)2). (29)

Setting , we have by the assumption that the term in the exponent at the rhs of (29) is negative and therefore the probability decays exponentially as increases.

Bound for all : Let us rewrite as and , with and , where . We get to the desired result by setting and using the triangle inequality combined with Proposition 8 to control and , the fact that and the Taylor expansions of the and functions to control the terms related to (see (5)).

## Appendix B Proof of Theorem 4

Proof: Instead of proving Theorem 4 directly, we deduce it from Theorem 3. First we notice that (4) is equivalent to

 ∣∣∣12∥ρ(Mx)∥22+12∥ρ(My)∥22−14∥x∥22−14∥y∥22 (30) ∥x∥2∥y∥22π(sin(∠(x,y))−cos(∠(x,y))∠(x,y)))∣∣∣≤δ2.

As , we also have that with high probability (like the one in Theorem 3),

 ∣∣∣∥ρ(Mx)∥22−12∥x∥22∣∣∣≤δ, ∀x∈K. (31)

(The proof is very similar to the one of Theorem 3). Applying the reverse triangle inequality to (30) and then using (31), followed by dividing both sides by , lead to

 ∣∣∣(2ρ(Mx)Tρ(My)∥x∥2∥y∥2−cos(∠(x,y))− (32) 1π(sin(∠(x,y))−cos(∠(x,y))∠(x,y)))∣∣∣≤3δ∥x∥2∥y∥2.

Using the reverse triangle inequality with (32) leads to

 (33) 1π(sin(∠(x,y))−cos(∠(x,y))∠(x,y)))∣∣∣≤3δ∥x∥2∥y∥2

To complete the proof it remains to bound the rhs of (33). For the second term in it, we have

 (34)

Because , it follows from (31) that

 ∥ρ(Mx)∥22≥12β2−δ. (35)

Dividing by both sides of (31) and then using (35) and the fact that , provide

 ∣∣ ∣∣1−∥x∥2√2∥ρ(Mx)∥2∣∣ ∣∣ (36) ≤δ(∥ρ(Mx)∥2+1√2∥x∥2)∥ρ