Stability for the Training of Deep Neural Networks and Other Classifiers

# Stability for the Training of Deep Neural Networks and Other Classifiers

## Abstract

We examine the stability of loss-minimizing training processes that are used for deep neural networks (DNN) and other classifiers. While a classifier is optimized during training through a so-called loss function, the performance of classifiers is usually evaluated by some measure of accuracy, such as the overall accuracy which quantifies the proportion of objects that are well classified. This leads to the guiding question of stability: does decreasing loss through training always result in increased accuracy? We formalize the notion of stability, and provide examples of instability. Our main result consists of two novel conditions on the classifier which, if either is satisfied, ensure stability of training, that is we derive tight bounds on accuracy as loss decreases. We also derive a sufficient condition for stability on the training set alone, identifying flat portions of the data manifold as potential sources of instability. The latter condition is explicitly verifiable on the training dataset. Our results do not depend on the algorithm used for training, as long as loss decreases with training.

## 1 Introduction

Our purpose in the present article is to provide rigorous justifications to the basic training method commonly used in learning algorithms. We particularly focus on the stability of training in classification problems. Indeed, if training is unstable, it is difficult to decide when to stop. To better explain our aim, it is useful to introduce some basic notations.

We are given a set of objects whose elements are classified in a certain number of classes. Then we introduce a function called the exact classifier that maps each to the index of its class. However, the exact classifier is typically only known on a finite subset of called the training set.

In practice, objects in are classified by an approximate classifier, and the mathematical problem is to identify an optimal approximate classifier among a large set of potential candidates. The optimal approximate classifier should agree (or, at least nearly agree) with the exact classifier on . We consider here a type of approximate classifier called soft approximate classifiers which may again be described in a general setting as functions

 ϕ:s∈T⟶ϕ(s)=(p1(s),…,pK(s))∈[0, 1]K, with p1(s)+…+pK(s)=1. (1)

Such a classifier is often interpreted as giving the probabilities that belongs to a given class: can be described as the predicted probability that is in class . In this framework a perfect classifier on is a function s.t. if and only if (and hence if ).

In practice of course, one cannot optimize among all possible functions and instead some sort of discretization is performed which generates a parametrized family where the parameters can be chosen in some other large dimensional space, with for instance. Both and are typically very large numbers, and the relation between them plays a crucial role in the use of classifiers for practical problems.

Deep neural networks (DNNs) are of course one of the most popular examples of method to construct such parametrize family . If is a DNN, it is a composition of several functions called layers. Each layer is itself a composition of one linear and one nonlinear operation. In this setting, the parameters are entries of matrices used for the linear operations on each layer. Figure 1 shows a diagram of a DNN.

Such deep neural networks have demonstrated their effectiveness on a variety of clustering and classifying problems, such as the well-known [15] for handwriting recognition. To just give a few examples, one can refer to [13] for image classification, [12] on speech recognition, [16] for an example of applications to the life sciences, [19] on natural language understanding, or to [14] for a general presentation of deep neural networks.

The training process consists of optimizing in and the family to obtain the “best” choice. This naturally leads to the question of what is meant by best, which is at the heart of our investigations in this article. Here, best means the highest performing classifier. There are indeed several ways to measure the performance of classifiers: one may first consider the overall accuracy which corresponds to the proportion of objects that are “well-classified.” We say that is well-classified by if , that is gives the highest probability to the class to which actually belongs. Because this set will come up often in our analysis, we introduce a specific notation for this “good” set

 G(α)={s∈T: pi(s)(s,α)>max1≤i≤K,i≠i(s)pi(s,α)}, (2)

which leads to the classical definition of overall accuracy

 acc(α)=#G(α)#T. (3)

For simplicity in this introduction, we assign the same weight to each object in the training set . In practice, objects in are often assigned different weights, which will be introduced below.

Other measures of performance exist and are commonly used: average accuracy where the average is taken within each class and which gives more weights to small classes, Cohen ’s coefficient [6], etc….

However, in practice training algorithms do not optimize accuracy (whether overall accuracy or some other definition) but instead try to minimize some loss function. There are compelling reasons for not using the accuracy: for example, accuracy distinguishes between well-classified and not well-classified in a binary manner. That is, accuracy does not account for an object close to being well classified. Moreover, accuracy is a piecewise-constant function (taking values ) so that its gradient is zero. For these reasons, it cannot be maximized with gradient descent methods. Also approximating the piecewise constant accuracy function by a smooth function is computationally intensive, particularly due the high dimensionality of its domain.

A typical example of a loss function is the so-called cross-entropy loss:

 ¯L(α)=−1#T∑s∈Tlog(pi(s)(s,α)), (4)

which is simply the average over the training set of the functions . Obviously each of these functions is minimized if , i.e. if the classifier worked perfectly on the object .

Minimizing (4) simply leads to finding the parameters such that on average is as large (as close to ) as possible. Since is smooth in (at least if is), one may apply classical gradient algorithms, with stochastic gradient descent (SGD) being among the most popular due to the linear structure of and the large size of many training sets. Indeed, since (4) consists of many terms, computing the gradient of is expensive. SGD simplifies this task by randomly selecting a batch of a few terms at each iteration of the descent algorithm, and computing the gradient of only those terms. This optimization process is referred to as training. Though in general loss decreases through training, it need not be monotone because of e.g. effects due to stochasticity. In this work, we assume for simplicity that loss decreases monotonically, which is approximately true in most practical problems.

• If the loss function converges to , then the accuracy converges to as in that case all predicted probability converge to . Nevertheless the training process is necessarily stopped at some time, before reaches exactly , see e.g. [2]. A first question is therefore how close to the accuracy is when the loss function is very small.

• In practice, we may not be able to reach perfect accuracy (or loss) on every training set. This can be due to the large dimension, of the objects or of the space of parameters , which makes it difficult to computationally find a perfect minimizer even if one exists, with the usual issue of local minimizers. Moreover, there may not even exist a perfect minimizer, due for instance to classification errors on the training set (some objects may have been assigned to the wrong class). As a consequence, a purely asymptotic comparison between loss and accuracy as is not enough, and we need to ask how the loss function correlates with the accuracy away from .

• In general, there is no reason why decreasing would increase the accuracy, which is illustrated by the following elementary counter-example. Consider a setting with classes and an object which belongs to the first class and such that for the initial choice of parameter : . We may be given a next choice of parameter such that: . Then the object is well classified by the first choice of and it is not well classified by the second choice . Yet the loss function, which is simply here, is obviously lower for . Thus, while loss improves, accuracy worsens.

• The previous example raises the key issue of stability during training, which roughly speaking means that accuracy increases with training. Indeed, one does not expect accuracy to increase monotonically during training, and the question becomes what conditions would guarantee that accuracy increases during training? For example, one could require that that the good set monotonically grows during training; in mathematical terms, that would mean that if are parameters from a later stage of the training. However, such a condition would be too rigid and likely counterproductive by preventing the training algorithms from reaching better classifiers. At the same time, wild fluctuations in accuracy or in the good set would destroy any realistic hope of a successful training process, i.e., finding a high-accuracy classifier.

• While the focus of this work is on the stability during training, another crucial question is the robustness of the trained classifier, which is the stability of identifying classes with respect to small perturbations of objects , in particular . The issues of robustness and stability are connected. Lack of stability during the training can often lead to over-parametrization by extending the training process for too long. In turn over-parametrization typically implies poor robustness outside of the training set. This is connected to the Lipschitz norm of the classifier and we refer, for example, to [1].

• Since the marker of progress during training is the decrease of the loss function, stability is directly connected to how the loss function correlates with the accuracy. Per the known counterexamples, such correlation cannot always exist. Thus the key question is to be able to identify which features of the dataset and of the classifier are critical to establish such correlations and therefore ensure stability.

Our main contributions are to bring rigorous mathematical answers to this last question, in the context of simple deep learning algorithms with the very popular SGD algorithm. While part of our approach would naturally extend to other settings, it is intrinsically dependent on the approach used to construct the classifier, which is described in section 2.1. More specifically, we proceed in the following two steps.

• We first identify conditions on the distribution of probabilities defined in (1) for each which guarantee that correlates with accuracy. Specifically, we show that under these conditions, loss is controlled by accuracy (vice-versa is trivial). At this stage such conditions necessarily mix, in a non-trivial manner, the statistical properties of the dataset with the properties of the classifier (its architecture and parameters), introduced in section 2.1. Since these conditions depend on the classifier parameters which evolve with training, they cannot be verified before training starts, and they may depend on how the training proceeds.

• The second step is to disentangle the previous conditions to obtain separate conditions on the training set and on the neural network architecture and parameters. We are able to accomplish this on one of the conditions obtained in step . We provide an intuitive explanation of the main challenge here using DNN classifiers with only two classes as an example. The natural mathematical stability condition is on the distribution of the set of the differences in probability for each which directly measures how well classified an object is (how much more likely it has to have been assigned to the correct class). Our main stability condition prevents a cluster of very small probability differences in . However, a condition on the training dataset is far more feasible for practical use. The challenge is finding features on the data manifold in which rule out small clusters of probabilities in . The difficult part here lies in propagating the one-dimensional condition on probabilities backward through the DNN layers to a condition in dimensions, and describing the features that may cause small clusters of (see Remark 3.1). Specifically, if the data lies approximately on some manifold , flat portions of this manifold may lie parallel to the hyperplane kernel(s) of one or more of the linear maps of the DNN. In fact, only small clusters of probabilities in prevent stability, therefore only small flat portions of the data manifold are bad for stability. The nonlinear activations of the DNN can also lead to small clusters, which can be explained as follows. Corners created through a transformation of large flat portions of by the nonlinearity of piecewise linear activation function can convert a large flat portion into multiple small flat portions (e.g., converting a straight line into a jagged line), resulting in small clusters in .

Our hope is that the present approach and results will help develop a better understanding of why learning algorithms perform so well in many cases but still fail in other settings. This is achieved by providing a framework to evaluate the suitability of training sets and of neural network construction for solving various classification problems. Rigorous analysis of neural networks has of course already started and several approaches that are different from the present one have been introduced. We mention in particular the analysis of neural nets in terms of multiscale contractions involving wavelets and scattering transforms; see for example [5, 17, 18] and [8] for scattering transforms. While there are a multitude of recent papers aimed to make neural net-based algorithms (also known as deep learning algorithms) faster, our goal is to help make such algorithms more stable.

We conclude by summarizing the practical outcomes of our work:

• First, we derive and justify an explicitly verifiable conditions on the dataset that guarantee stability. We refer in particular to subsection 2.4 for a discussion of how to check our conditions in practice.

• Our analysis characterizes how the distribution of objects in the training set and the distribution of the output of the classifier for misclassified objects affect stability of training.

• Finally, among many possible future directions of research, our results suggest that the introduction of multiscale loss functions could significantly improve stability.

Acknowledgments. The work of L. Berlyand and C. A. Safsten was supported by NSF DMREF DMS-1628411, and the Work of P.-E. Jabin was partially supported by NSF DMS Grant 161453, 1908739, NSF Grant RNMS (Ki-Net) 1107444, and LTS grant DO 14. The authors thank R. Creese and M. Potomkin for useful discussions.

## 2 Main results

### 2.1 Mathematical formulation of deep neural networks and stability

#### Classifiers

A parameterized family of soft classifiers must map objects to a list of probabilities. To accomplish this, a classifier is a composition , where and is the so-called softmax function defined by

 ρ(x)=ρ(x1,⋯,xK)=(ex1∑Kk=1exk,⋯,exK∑Kk=1exk). (5)

Clearly, and , so is a soft classifier no matter what what function is used (though typically is differentiable almost everywhere). The form of the softmax function means that we can write the classifier as

 ϕ(s,α)=ρ∘X(s,α)=(eX1(s,α)∑Kk=1eXk(s,α),⋯,eXK∑nKk=1eXk(s,α)). (6)

As in (1), denote , where is the probability that belongs to class predicted by a classifier with parameters . A key property of the softmax function is that it is order preserving in the sense that if , then . Therefore, the predicted class of can be determined by . We define the key evaluation that determines whether an object is well-classified or not, namely

 δX(s,α)=Xi(s)(s,α)−maxj≠i(s)Xj(s,α). (7)

If , then is the largest component of , which means that is the largest probability given by , and thus, is classified correctly. Similarly, if , is classified incorrectly.

As described in Section 1, a classifier learns to solve the classification problem by training on a finite set where the correct classifications are known. Training is completed by minimizing a loss function which measures how far the classifier is from the exact classifier on . While there are many types of loss functions, cross entropy loss introduced in (4) is very common, and it is the loss function we will consider in this work. The loss in (4) is the simple average of over all , but there is no reason we cannot use the weighted average:

 ¯L(α)=−∑s∈Tν(s)log(pi(s)(s,α)), (8)

where and . Weights could be uniform, i.e., for all , or weights can be non uniform if e.g. some are more important than others. We can also use to measure the size of subset of , e.g., if , . The quantity defined above facilitates some convenient estimates on loss, which are shown in section 3.1.

#### Deep neural network structure

Deep neural networks (DNNs) are a diverse set of algorithms with the classification problem being just one of many of their applications. In this article, however, we will restrict our attention to DNN classifiers. DNNs provide a useful parameterized family which can be composed with the softmax function to form a classifier. The function is a composition of several simpler functions:

 X(⋅,α)=fM(⋅,αM)∘fM−1(⋅,αM−1)∘⋯∘f1(⋅,α1). (9)

Each for is a composition of an affine transformation and a nonlinear function. The nonlinear function is called an activation function. A typical example is the so-called rectified linear unit (ReLU), which is defined for any integer by Another example is the componentwise absolute value, . The affine transformation depends on many parameters (e.g., matrix elements) which are denoted together as . The collection of all DNN parameters is denoted .

Though we use DNN classifiers as a guiding example for this article, most results apply to classifiers of the form where is the softmax function, and is any family of functions parameterized by . In this article, we will use the term classifier to refer to any composition , while DNN classifier refers to a classifier where has the structure of a DNN.

#### Training, Accuracy, and Stability

Training a DNN is the process of minimizing loss. In practice, one randomly selects a starting parameter , and then uses an iterative minimization algorithm such as gradient descent or stochastic gradient descent to find a minimizing . Whatever algorithm is used, the iteration calculates using for . Our results do not depend on which algorithm is used for training, but will make the essential assumption that loss decreases with training, for . Throughout this article, we will abuse notation slightly by writing .

Accuracy is simply the proportion of well-classified elements of the training set. Using and the weights , we can define a function that measures accuracy for all times during training:

 acc(t)=ν({s∈T:δX(s,α(t))>0}). (10)

We will find it useful to generalize the notion of accuracy. For instance, we may want to know how many are not only well-classified, but are well-classified by some margin . We therefore define the good set of margin as

 Gη(t)={s∈T:δX(s,α(t))>η}. (11)

Observe that . For large , the good set comprises those elements of that are exceptionally well-classified by the DNN with parameter values . We will also consider the bad set of margin

 B−η(t)={s∈T:δX(s,α(t))≤−η} (12)

which are the elements that are misclassified with a margin of by the DNN with parameters .

Stability is the idea that when, during training, accuracy becomes high enough, it remains high for all later times. Specifically, we will prove that under certain conditions, for all , there exists and so that if at some time , , then at all later times, .

### 2.2 Preliminary remarks and examples

#### Relationship between accuracy and loss

Intuitively, accuracy and loss should be connected, i.e., as loss decreases, accuracy increases and vice versa. However, as we will see in examples below, this is not necessarily the case. Nevertheless, we can derive some elementary relations between the two. For instance, from equation (40), we may easily derive a bound on the good set via for some for all times :

 ν(Gη(t))≥1−¯L(t0)log(1+e−η)≥1−2eη¯L(t0), (13)

and in particular,

 ν(acc(t))=ν(G0(t))≥1−¯L(t0)log2. (14)

This shows if loss is sufficiently small at time , then accuracy will be high for all later times. But this is not the same as stability; stability means that if accuracy is sufficiently high at time , then it will remain high for all . To obtain stability from (13), we somehow need to guarantee that high accuracy at time implies low loss at time .

###### Example 2.1.

This example will demonstrate instability in a soft classifier resulting from a small number of elements of the training set that are misclassified. Let be a training set with elements with uniform weights, each classified into one of two classes. Suppose that at some time , after some training, the parameters are such that most of the values are positive, but a few are clustered near . An example histogram of these values is shown in Figure 2a. The loss accuracy can be calculated using (8) and (10) respectively. For the values in Figure 2a, the loss and accuracy are is

 ¯L(t0)=0.1845acc(t0)=0.95.

Suppose that at some later time , the values are those shown in Figure 2b. Most values have improved from to , but a few have worsened. We can again calculate loss and accuracy:

 ¯L(t1)=0.1772acc(t1)=0.798.

Since , this example satisfies the condition that loss must decrease during training. However, accuracy has fallen considerably. This indicates an unstable classifier. The instability arises because enough objects have sufficiently poor classifications that by improving their classification (increasing ), training can still decrease loss if a few correctly classified objects become misclassified, decreasing accuracy.

### 2.3 Main Results

As explained above, to establish stability, we must bound in terms of the accuracy at . Assuming that accuracy at is high, we may separate the training set into a large good set for some its small complement . Using (40), we may make the following estimate, details for which are found in Section 4.

 ¯L(t0)≤(K−1)e−η+∑s∈GCη(t0)ν(s)log(1+(K−1)e−δX(s,α(t0))). (15)

The first term in (15) is controlled by . If is even moderately large, then the second term dominates (15), with most of the loss coming from a few . It is therefore sufficient to control the distribution of . There are two primary reasons why this distribution may lead to large , both relating to :

1. There is a single that is very large and negative, that is, .

2. is not far from zero, but there are many near .

To obtain a good bound on , we must address both issues:

1. Assume that for all , i.e., the bad set is empty.

2. Impose a condition that prevents form concentrating near . Such conditions are called small mass conditions since they only concern for in the small mass of objects in . Example 2.1 illustrates this issue.

In the following subsections, we introduce two small mass conditions A and B that guarantee stability. We show that each condition leads to small loss and ultimately to stability. Condition A leads to the tightest stability result, but it must be verified for all time , whereas condition B is verified only at the initial time . While conditions A and B are quite precise mathematical conditions, they do not directly impose conditions on the given data set , with which one usually deals in practical problems. That is why in section 2.3.2, we introduce the so-called no-small-isolated data clusters (NSIDC) condition on the data set which is sufficient for condition B, but does not depend the network parameters . Therefore the stability result under the NSIDC condition does not depend on these parameters, unlike stability under condition B. Even though this condition leads to a less-tight stability result than condition A, it is preferable in practice. In fact, further relaxation of the NSIDC condition for easy verification which only makes stability highly likely rather than guaranteed may be a good future direction.

#### Condition on data and DNN for stability.

The first small mass condition we will consider ensures that the distribution of values decays very quickly near , the minimum value, so there cannot be a concentration of near , resulting in high accuracy at implying low loss at . Additionally, by applying this condition at all times, not just at , we can improve the estimate (13).

###### Definition 2.1.

The set satisfies condition A at time if there exist constants , , , and so that for and all ,

 ν({s∈T:δX(s,α(t))

How precisely this condition limits small clusters is presented in Section 4.1, but a brief description of the role of each constant is given here:

• , the most important constant in (16), controls the degree to which -values can concentrate near . In particular, smaller is, the less concentrated the values are, leading to a better stability result.

• controls how quickly the distribution of values decays to zero near . The faster the decay, the larger may be, and the better for stability. Calculations are greatly simplified if .

• controls how fast the distribution decays near , leading to a bound on the size of for and . A simple calculation shows that for (16) to hold, we require .

• accounts for the fact that the training data is a discrete set. In (16), if and for some small , then it is possible that

 {s∈T:δX(s,α(t))

with both sets containing a single object . Furthermore,

 {s∈T:δX(s,α(t))

Thus, taking , we would require to satisfy (16). Such a large means that the stability result will be rather weak. In this sense, (16) identifies the individual point as a very small cluster. Subtracting from the left side of the inequality allows the inequality to hold with much smaller , accommodating a discrete dataset. Typically, we choose equal to about the mass of 5-10 elements of the training set. Though the presence of a small does not substantially affect stability, it does complicate the proofs. Therefore, in Theorem 2.1, we assume , which corresponds to the limit .

With condition A in hand, we can state the first stability result. Proofs and supporting lemmas are left for Section 3.

###### Theorem 2.1.

Suppose that with weights is a training set for a classifier such that which satisfies condition A for some constants , , , and . for all . Then for every there exist such that if good and bad sets at satisfy

 ν(Gη(t0))>1−δandB−1(t0)=∅, (19)

then for all ,

 acc(t)=ν(G0(t))≥1−ε. (20)

and

 ν(Gη∗(t))>1−ε−(3/4)−log(3Λ¯L(t0))2η∗ (21)

for all : .

###### Remark 2.1.

The conclusion of theorem 2.1 depends on the hypothesis that condition A holds. Short of brute force calculation on a case-by-case basis, there is at present no way to determine whether condition A holds for a given training set and parameter values , and for which constants , , , and it might hold. Furthermore, Theorem 2.1 does not control the dynamics of training with sufficient precision to guarantee that if condition A holds at then it will also hold for all . Therefore, we have to further strengthen the hypotheses by insisting that condition A holds for all .

###### Remark 2.2.

Though a more general version of Theorem 2.1 can be proved for and , both the statement and proof are much more tractable with and .

###### Remark 2.3.

For Theorem 2.1, it is sufficient to choose

 δ=12Λandη=max⎧⎪⎨⎪⎩1,⎛⎝1ϕlog⎛⎝(10(K−1)log2)ϕ3Λε⎞⎠⎞⎠2,log(30Λ(K−1)log2)2⎫⎪⎬⎪⎭. (22)

#### Doubling conditions on data independent of DNN to ensure stability

Our second condition will also limit the number of values that may cluster near . It does this by ensuring that if a few are clustered near , then there must be more values that are larger than the values in the cluster. This means that the cluster near is not isolated, that is not all of the can have near . Observe that this condition is on both the data and the DNN. In this section we show this second condition follows from a so-called doubling condition on the dataset only, which is obviously advantageous in applications

Before introducing rigorously the so-called doubling condition on data that ensures stability of training, we provide its heuristic motivation. Consider data points in and a bounded domain containing some data points. Rescale by a factor of to obtain the “doubled” domain . If the number of data points in is the same as in , then the data points in form a small cluster confined in . In contrast, if contains more points than , then the the cluster of points in is not isolated, which can be formulated as the following no-small-cluster (NSC) condition:

 \# of points in 2W≥(1+σ)×(\# of points in W)~{}~% {}~{}~{}~{}for some σ>0 (23)

See Figure 3 for an illustration of how the doubling condition detects clusters.

This condition will be applied to and for the data points in . It is more practical to check this condition on the dataset which is why we here we explain the underlying heuristics for the condition on the dataset. Consider a 2D ellipse translated along a straight line to obtain an infinite cylinder with elliptical cross section. Next, introduce the truncated cylinder by intersecting it with a half-space in 3D, (e.g., all where ). Then the toy version of the NSC condition for is

 ν(κεE)≥(1+σ)ν(εE), (24)

where doubling has been replaced with an arbitrary rescaling parameter . Finally, ensures that the domain is sufficiently small because some features of the data manifold are seen only at small scales, and allowing large scales which rule out datasets which are fine for stability. For example, a DNN may map a flat portion of the data manifold to a point due to degeneracy of its linear maps, leading to a small cluster. Therefore small is needed to resolve non-flatness on small scales when it may appear as flatness on a large scale, e.g., a sine curve from far away looks flat.

We will present two doubling conditions which lead to stability. The first is a doubling condition on the set of values in . The other is a doubling condition on the training data . Both of the following two definitions provide precise formulation of the doubling condition (23) and its generalization (24).

###### Definition 2.2.

The set satisfies condition B at time if there exists a mass and constants , and such that for all , there exists and for all intervals around with ,

 (25)

where is the interval with the same center as but whose width is multiplied by .

Condition B comes with a notable advantage: it is a consequence of a similar condition on the training set called the no small data clusters condition. For this, we define truncated slabs as the intersection of a slab in some direction and centered at with a finite number of linear constraints

 Sl=Sl(Q,s,v1,t1,…,vk,tk)={x, |(x−s)⋅u|≤1 and vi⋅x≤ti ∀i=1,…,k}, (26)

We also consider the rescaled slab which is obtained by changing the width of by a factor ,

 κSl={x, |(x−s)T⋅u|≤κ and vi⋅x≤ti ∀i=1,…,k}. (27)

We can now state our no small data clusters condition

###### Definition 2.3.

Let be the extension of the measure from to by for all . The no small isolated data clusters condition holds if there exists , and so that for each truncated slab , that is, each choice of , , , , there exists and for any then

 ¯ν(κεSl)≥min{δ,max{m0,(1+σ)¯ν(εSl)}−m0}. (28)

We can now present our second stability theorem:

###### Theorem 2.2.

Assume that the set satisfies condition B at as given by definition (2.2) at some time . Then for every there exists a constant such that if

 m0≤εC,log1ε+C≤η≤I0,δ0≤Cεηlog(1+σ)/logκ, (29)

and if good and bad sets at satisfy

 ν(Gη(t0))>1−δ0andB−1(t0)=∅, (30)

then for all ,

 acc(t)=ν(G0(t))≥1−ε. (31)

and

 ν(Gη∗(t))≥1−2log2εeη∗, (32)

for all .

The following theorem guarantees that if the training set satisfies the no small data clusters condition, then the set satisfies condition B, no matter what what the values of the parameters are. Theorem 2.3 is useful because it allows us to establish stability based solely on training set , without considering training. Theorems 2.1 and 2.2 apply to general classifiers, unlike the following Theorem 2.3 which only applies to classifiers consisting of a DNN with softmax as its last layer.

###### Theorem 2.3.

Let be a training set for a DNN with weights and whose activation functions are the absolute value function. For all , , and , there exists , , and so that if the no small isolated data clusters condition (28) holds on with constants , , , and , then condition B given by (25) holds on with constants , , and for any .

The proof of Theorem 2.3 actually propagates a more general condition on the layers that is worth stating. Instead of truncated slabs, it applies to more general truncated ellipsoidal cylinders, namely

 E=E(Q,s,v1,t1,…,vk,tk)={x, (x−s)TQ(x−s)≤1 and vi⋅x≤ti ∀i=1,…,k}, (33)

for some symmetric positive semi-definite matrix . The dilated cylinders by some factor are obtained with

 κE={x, (x−s)TQ(x−s)≤κ2 and vi⋅x≤ti ∀i=1,…,k}. (34)

Of course if has rank then the definitions (33)-(34) exactly correspond to (26)-(27).

In this more general context, the extended no small data clusters condition on any measure reads for given , , , , and ,

 ∀Q∈Md(Rd) with rankQ≤r,∀v1,…,vk∈Rd∖{0},∀s∈Rd,∀t1,…,tk∈R, ∃ε0 s.t. ∀ε≤ε0,μ(κεE(u,s,v1,t1,…,vk,tk))≥min{δ,max{m0, (1+σ)μ(εE(u,s,v1,t1,…,vk,tk))}−m0}. (35)

The key role played by truncated ellipsoids is worth noting as they satisfy two important conditions:

• The image of a truncated ellipsoid by any linear map is another truncated ellipsoid;

• The inverse image of a truncated ellipsoid by the absolute value non-linear map consists exactly of two other truncated ellipsoids; and

• It is those properties that allow the propagation of the no small data clusters condition through the layers.

Theorem 3.6 applies to DNNs using absolute value as their activation function because (i) the absolute value function is finite-to-one unlike e.g., ReLU, and (ii) the absolute value function is piecewise linear unlike e.g., sigmoid. Property (i) is important because an infinite-to-one function will map large portions of the training set to a point, resulting in a small cluster. It is reasonable to conjecture that Theorem 2.3 extends to allow any finite-to-one piecewise linear activation function. It is even likely that a more restrictive version of the no-small-isolated data clusters condition will allow ReLU to be used in 2.3. This is the subject of future work. Additionally (ii) is helpful because having piecewise linear activations means that each map (from (9)) passing from one DNN layer to the next is also piecewise linear, so propagating the condition described by (28) through the DNN is more tractable than if each layer were fully nonlinear.

#### Examples of applications of Condition A and B

Revisiting example 2.1. 1Conditions and and their associated Theorems 2.1 and 2.2 guarantee stability, so why does stability fail in Example 2.1? The answer lies in the constants found in conditions A and B. Condition is satisfied relative to constants , , , and , while condition B is satisfied relative to constants , , , and . We will determine for which constants conditions and are satisfied. For both conditions, we will consider a typical value of .

Theorem 2.1 requires . Taking , we find via brute force calculation that for the values given in example must exceed . In the proof of Theorem 2.1, we will see that and must be chosen so that However, in Example 2.1,

 3Λ¯L(t0)≥3⋅18⋅0.1845>1.

Therefore, though Theorem 2.1 guarantees the existence of small enough and large enough to get stability, such and for Example 2.1 will not satisfy the primary hypothesis of the theorem, that is .

Theorem 2.2 requires . If , then , so either , which is not allowed in condition B, or . But if , then since , it is impossible to obtain for any .

Since Example 2.1 can only satisfy conditions A and B with constants that are either too large or too small, we cannot apply the stability theorems to it.

To see how Theorems 2.1 and 2.2 can be applied, consider the following example.

###### Example 2.2.

Suppose a classifier properly classifies all elements in its training set at . In fact, for some large , . How large high will accuracy be at later times?

First suppose that is a two-class training set for a classifier which satisfies the condition A for all for some constants , , and . Theorem 2.1 tells us that given , if is large enough, then

 acc(t)>1−ε

for all . But how small can be? Remark 2.3 gives a relationship between and , and in particular, provided is sufficiently large, we may choose

 ε=(10log2)ϕ3Λeϕ√η.

Since is large, is small. Therefore, at all later times, we are guaranteed high accuracy. Additionally, good sets also remain large. For example, by choosing

 η∗=log(3/4)2logϵ(η−log(3Λ))

A simple computation with (21) using shows that

 ν(Gη∗(t))>1−2ε.

for all . Therefore, set remains large for all , the price paid being that . If is very large, we can in fact note that .

Alternatively, we may choose

 η∗=log3/42log2(η−log(3Λ))

gives

 ν(Gη∗(t))≥12−ε,

meaning the median of the distribution of values is greater than with now .

Similarly, if condition B is satisfied at for some constants , , and , instead of condition A, then we can apply theorem 2.2. Using (29), we conclude that letting

 ε=max{m0C,eC−η}

we have

 acc(t)>1−ε.

Since is small and is large, is also small, so accuracy remains high. By letting , (32), we have

 ν(Gη∗(t))>1−2ε,

so this good set also remains large but with a significantly worse than for condition A.

Finally, if , then

 ν(Gη∗(t))>1/2,

so again the median of the distribution is greater than . Nevertheless, here we still have that .

### 2.4 How to verify condition A and the no-small-isolated data clusters condition for a given dataset

In this section, we discuss how to verify conditions A and the no-small-isolated data clusters condition to ensure stability of training algorithms in a real-world setting. For completeness, we review notations:

• We consider classifiers of the form where is the softmax function and depends on parameters where is the present iteration of the training process.

• is a training set for the classifier containing objects . For each , is the index of the correct class of .

• Each object has a positive weight with .

• .

Now we will explain how to verify condition A and the NSIDC condition, and how to use them to guarantee stability.

• Condition A. Train the classifier until a time when a reasonable degree of accuracy is achieved. Calculate the values of for each . One way to do this is to choose the typical values , , and so that the only constant to solve for in (16) is . To this end, make the observation:

 minimal Λ such that(???) is satisfied=maxx1∈{β,0},x2>x1ν({s∈T:δX(s,α(t0))

Therefore, finding the optimal is a matter of solving a maximization problem. For fixed , the right hand side of (36) is a piecewise constant function with discontinuities at for . Thus, the maximization problem is solved by sampling the right hand side of (36) at and , and for all and then finding the maximum of the resulting list of samples.

With condition A satisfied for some known constants, one may apply Theorem 2.1. However, as seen with Example 2.1, if is too large, Theorem 2.1 may still require larger than it actually is. It may be possible to decrease by repeating the maximization process with smaller or larger . If an acceptable is found, 2.1 guarantees stability. If not, one may need to train longer to find an acceptable .

• NSIDC condition. Assume that the classifier is a DNN with the absolute value function as its activation function. By verifying the no-small-isolated data clusters condition, we are guaranteed that condition B holds for all time. Therefore, we need only verify once that the no-small-isolated data clusters condition holds, which is its principal advantage over condition A. To verify this condition, choose constants , , and . Let be the set of all truncated ellipsoidal cylinders in such that and . We are left with finding the constant which satisfies (28) for all truncated ellipsoidal cylinders in :

 maximal σ′ such that (???) is satisfied=minE∈Pν(κE∩T)ν(E∩T)−1. (37)

As in verifying condition A, we will solve this minimization problem by discretizing the domain of minimization, , and sampling the objective function only on that discretization. It should be noted that the dimension of is of order . Since (the dimension of the space containing ) is typically high, this means that a sufficiently fine discretization is necessarily quite large.

After finding , we may be sure that condition B is satisfied at every iteration of the training algorithm for known constants. Therefore, we may