Stability for the Training of Deep Neural Networks and Other Classifiers
Abstract
We examine the stability of lossminimizing training processes that are used for deep neural network (DNN) and other classifiers. While a classifier is optimized during training through a socalled loss function, the performance of classifiers is usually evaluated by some measure of accuracy, such as the overall accuracy which quantifies the proportion of objects that are well classified. This leads to the guiding question of stability: does decreasing loss through training always result in increased accuracy? We formalize the notion of stability, and provide examples of instability. Our main result is two novel conditions on the classifier which, if either is satisfied, ensure stability of training, that is we derive tight bounds on accuracy as loss decreases. These conditions are explicitly verifiable in practice on a given dataset. Our results do not depend on the algorithm used for training, as long as loss decreases with training.
1 Introduction
Our purpose in the present article is to provide rigorous justifications to the basic training method commonly used in learning algorithms. We particularly focus on classification problems and to better explain our aim, it is useful to introduce some basic notations.
We are given a set of objects whose elements are classified in a certain number of classes. Then we introduce a function called the exact classifier that maps each to the index of its class. However, the exact classifier is typically only known on a finite subset of called the training set.
In practice, objects in are classified by an approximate classifier, and the mathematical problem is to identify an optimal approximate classifier among a large set of potential candidates. The optimal approximate classifier should agree (or, at least nearly agree) with the exact classifier on . We consider here a type of approximate classifier called soft approximate classifiers which may again be described in a general setting as functions
(1) 
Such a classifier is often interpreted as giving the probabilities that belongs to a given class: can be described as the predicted probability that is in class . In this framework a perfect classifier on is a function s.t. if and only if (and hence if ).
In practice of course, one cannot optimize among all possible functions and instead some sort of discretization is performed which generates a parametrized family where the parameters can be chosen in some other large dimensional space, with for instance. Both and are typically very large numbers, and the relation between them plays a crucial role in the use of classifiers for practical problems.
Deep neural networks (DNNs) are of course one of the most popular examples of method to construct such parametrize family . In those settings is obtained by applying a sequence of linear and nonlinear operations on the initial object , one linear and nonlinear operation per layer in the network. In this setting, the parameters are entries of matrices used for the linear operations on each layer.
Such deep neural networks have demonstrated their effectiveness on a variety of clustering and classifying problems, such as the wellknown [14] for handwriting recognition. To just give a few examples, one can refer to [12] for image classification, [11] on speech recognition, [15] for an example of applications to the life sciences, [18] on natural language understanding, or to [13] for a general presentation of deep neural networks.
The training process consists of optimizing in the family to obtain the “best” choice. This naturally leads to the question of what is meant by best, which is at the heart of our investigations in this article. Here, best means the highest performing classifier. There are indeed several ways to measure the performance of classifiers: one may first consider the overall accuracy which corresponds to the proportion of objects that are “wellclassified.” We say that is wellclassified by if , that is gives the highest probability to the class to which actually belongs. Because this set will come up often in our analysis, we introduce a specific notation for this “good” set
(2) 
which leads to the classical definition of overall accuracy
(3) 
For simplicity in this introduction, we assign the same weight to each object in the training set . In practice, objects in are often assigned different weights, which will be introduced below.
Other measures of performance exist and are commonly used: average accuracy where the average is taken within each class and which gives more weights to small classes, Cohen ’s coefficient [5], etc….
However, in practice training algorithms do not optimize accuracy (whether overall accuracy or some other definition) but instead try to minimize some loss function. There are compelling reasons for not using the accuracy: for example, accuracy distinguishes between wellclassified and not wellclassified in a binary manner. That is, accuracy does not account for an object close to being well classified. Moreover, accuracy is a piecewiseconstant function (taking values ) so that its gradient is zero. For these reasons, it cannot be maximized with gradient descent methods. An approximation of this piecewise constant function is computationally intensive, particularly due the high dimensionality of its domain.
A typical example of a loss function is the socalled crossentropy loss:
(4) 
which is simply the average over the training set of the functions . Obviously each of these functions is minimized if , i.e. if the classifier worked perfectly on the object .
Minimizing (4) simply leads to finding the parameters such that on average is as large (as close to ) as possible. Since is smooth in (at least if is), one may apply classical gradient algorithms, with stochastic gradient descent (SGD) being among the most popular due to the linear structure of and the large size of many training sets. Indeed, since (4) consists of many terms, computing the gradient of is expensive. SGD simplifies this task by randomly selecting a batch of a few terms at each iteration of the descent algorithm, and computing the gradient of only those terms. This optimization process is referred to as training. Though in general loss decreases through training, it need not be monotone because of e.g. effects due to stochasticity. In this work, we assume for simplicity that loss decreases monotonically, which is approximately true in most practical problems.
The main question that we aim to answer in this article is why should decreasing the loss function improve accuracy. We start by pointing out the following observations which explain why the answer is not straightforward.

If the loss function converges to , then the accuracy converges to as in that case all predicted probability converge to . Nevertheless the training process is necessarily stopped at some time, before reaches exactly , see e.g. [2]. A first question is therefore how close to the accuracy is when the loss function is very small.

In practice, we may not be able to reach perfect accuracy (or loss) on every training set. This can be due to the large dimension, of the objects or of the space of parameters , which makes it difficult to computationally find a perfect minimizer even if one exists, with the usual issue of local minimizers. Moreover, there may not even exist a perfect minimizer, due for instance to classification errors on the training set (some objects may have been assigned to the wrong class). As a consequence, a purely asymptotic comparison between loss and accuracy as is not enough, and we need to ask how the loss function correlates with the accuracy away from .

In general, there is no reason why decreasing would increase the accuracy, which is illustrated by the following elementary counterexample. Consider a setting with classes and an object which belongs to the first class and such that for the initial choice of parameter : . We may be given a next choice of parameter such that: . Then the object is well classified by the first choice of and it is not well classified by the second choice . Yet the loss function, which is simply here, is obviously lower for . Thus, while loss improves, accuracy worsens.

The previous example raises the key issue of stability during training, which roughly speaking means that accuracy increases with training. Indeed, one does not expect accuracy to increase monotonically during training, and the question becomes what conditions would guarantee that accuracy increases during training? For example, one could require that that the good set monotonically grows during training; in mathematical terms, that would mean that if are parameters from a later stage of the training. However, such a condition would be too rigid and likely counterproductive by preventing the training algorithms from reaching better classifiers. At the same time, wild fluctuations in accuracy or in the good set would destroy any realistic hope of a successful training process, i.e., finding a highaccuracy classifier.

While the focus of this work is on the stability during training, another crucial question is the robustness of the trained classifier, which is the stability of identifying classes with respect to small perturbations of objects , in particular . The issues of robustness and stability are connected. Lack of stability during the training can often lead to overparametrization by extending the training process for too long. In turn overparametrization typically implies poor robustness outside of the training set. This is connected to the Lipschitz norm of the classifier and we refer, for example, to [1].

Since the marker of progress during training is the decrease of the loss function, stability is directly connected to how the loss function correlates with the accuracy. Per the known counterexamples, such correlation cannot always exist. Thus the key question is to be able to identify which features of the dataset and of the classifier are critical to establish such correlations and therefore ensure stability.
Our main contributions are to bring rigorous mathematical answers to this last question, in the context of simple deep learning algorithms with the very popular SGD algorithm. While part of our approach would naturally extend to other settings, it is intrinsically dependent on the approach used to construct the classifier, which is described in section 2.1. More specifically, we proceed in the following two steps.

We first identify conditions on the distribution of probabilities defined in (1) for each which guarantee that correlates with accuracy. Specifically, we show that under these conditions, loss is controlled by accuracy (viceversa is trivial). At this stage such conditions necessarily mix, in a nontrivial manner, the statistical properties of the dataset with the properties of the neural network (its architecture and parameters), introduced in section 2.1. Since these conditions depend on the network parameters which evolve with training, they cannot be verified before training starts, and they may depend on how the training proceeds.

The second step is to disentangle the previous conditions to obtain separate conditions on the training set and on the neural network architecture and parameters. We are able to accomplish this on one of the conditions obtained in step which is well suited to the combination of linear operations and activation function on the layer, defined in section 2.1. The main idea here is to be able to propagate backward on the neural network the required distribution of , see Remark 3.1.
Our hope is that the present approach and results will help develop a better understanding of why learning algorithms perform so well in many cases but still fail in other settings. This is achieved by providing a framework to evaluate the suitability of training sets and of neural network construction for solving various classification problems. Rigorous analysis of neural networks has of course already started and several approaches that are different from the present one have been introduced. We mention in particular the analysis of neural nets in terms of multiscale contractions involving wavelets and scattering transforms; see for example [4, 16, 17] and [7] for scattering transforms. While there are a multitude of recent papers aimed to make neural netbased algorithms (also known as deep learning algorithms) faster, our goal is to help make such algorithms more stable.
We conclude by summarizing the practical outcomes of our work:

First, we derive and justify an explicitly verifiable conditions on the dataset that guarantee stability. We refer in particular to subsection 2.5 for a discussion of how to check our conditions in practice.

Our analysis characterizes how the distribution of objects in the training set and the distribution of the output of the classifier for misclassified objects affect stability of training.

Finally, among many possible future directions of research, our results suggest that the introduction of multiscale loss functions could significantly improve stability.
Acknowledgments. The work of L. Berlyand and C. A. Safsten was supported by NSF DMREF DMS1628411, and the Work of P.E. Jabin was partially supported by NSF DMS Grant 161453, 1908739, NSF Grant RNMS (KiNet) 1107444, and LTS grant DO 14. The authors thank R. Creese and M. Potomkin for useful discussions.
2 Main results
2.1 Mathematical formulation of deep neural networks and stability
2.2 Classifiers
A parameterized family of soft classifiers must map objects to a list of probabilities. To accomplish this, a classifier is a composition , where and is the socalled softmax function defined by
(5) 
Clearly, and , so is a soft classifier no matter what what function is used (though typically is differentiable almost everywhere). The form of the softmax function means that we can write the classifier as
(6) 
As in (1), denote , where is the probability that belongs to class predicted by a classifier with parameters . A key property of the softmax function is that it is order preserving in the sense that if , then . Therefore, the predicted class of can be determined by . We define the key evaluation that determines whether an object is wellclassified or not, namely
(7) 
If , then is the largest component of , which means that is the largest probability given by , and thus, is classified correctly. Similarly, if , is classified incorrectly.
As described in Section 1, a classifier learns to solve the classification problem by training on a finite set where the correct classifications are known. Training is completed by minimizing a loss function which measures how far the classifier is from the exact classifier on . While there are many types of loss functions, cross entropy loss introduced in (4) is very common, and it is the loss function we will consider in this work. The loss in (4) is the simple average of over all , but there is no reason we cannot use the weighted average:
(8) 
where and . Weights could be uniform, i.e., for all , or weights can be non uniform if e.g. some are more important than others. We can also use to measure the size of subset of , e.g., if , . The quantity defined above facilitates some convenient estimates on loss, which are shown in section 3.1.
Deep neural network structure
Deep neural networks (DNNs) are a diverse set of algorithms with the classification problem being just one of many of their applications. In this article, however, we will restrict our attention to DNN classifiers. DNNs provide a useful parameterized family which can be composed with the softmax function to form a classifier. The function is a composition of several simpler functions:
Each for is a composition of an affine transformation and a nonlinear function. The nonlinear function is called an activation function. A typical example is the socalled rectified linear unit (ReLU), which is defined for any integer by Another example is the componentwise absolute value, . The affine transformation depends on many parameters (e.g., matrix elements) which are denoted together as . The collection of all DNN parameters is denoted .
Though we use DNN classifiers as a guiding example for this article, most results apply to classifiers of the form where is the softmax function, and is any family of functions parameterized by . In this article, we will use the term classifier to refer to any composition , while DNN classifier refers to a classifier where has the structure of a DNN.
Training, Accuracy, and Stability
Training a DNN is the process of minimizing loss. In practice, one randomly selects a starting parameter , and then uses an iterative minimization algorithm such as gradient descent or stochastic gradient descent to find a minimizing . Whatever algorithm is used, the iteration calculates using for . Our results do not depend on which algorithm is used for training, but will make the essential assumption that loss decreases with training, for . Throughout this article, we will abuse notation slightly by writing .
Accuracy is simply the proportion of wellclassified elements of the training set. Using and the weights , we can define a function that measures accuracy for all times during training:
(9) 
We will find it useful to generalize the notion of accuracy. For instance, we may want to know how many are not only wellclassified, but are wellclassified by some margin . We therefore define the good set of margin as
(10) 
Observe that . For large , the good set comprises those elements of that are exceptionally wellclassified by the DNN with parameter values . We will also consider the bad set of margin
(11) 
which are the elements that are misclassified with a margin of by the DNN with parameters .
Stability is the idea that when, during training, accuracy becomes high enough, it remains high for all later times. Specifically, we will prove that under certain conditions, for all , there exists and so that if at some time , , then at all later times, .
2.3 Preliminary remarks and examples
Relationship between accuracy and loss
Intuitively, accuracy and loss should be connected, i.e., as loss decreases, accuracy increases and vice versa. However, as we will see in examples below, this is not necessarily the case. Nevertheless, we can derive some elementary relations between the two. For instance, from equation (30), we may easily derive a bound on the good set via for some for all times :
(12) 
and in particular,
(13) 
This shows if loss is sufficiently small at time , then accuracy will be high for all later times. But this is not the same as stability; stability means that if accuracy is sufficiently high at time , then it will remain high for all . To obtain stability from (12), we somehow need to guarantee that high accuracy at time implies low loss at time .
Example 2.1.
This example will demonstrate instability in a soft classifier resulting from a small number of elements of the training set that are misclassified. Let be a training set with elements with uniform weights, each classified into one of two classes. Suppose that at some time , after some training, the parameters are such that most of the values are positive, but a few are clustered near . An example histogram of these values is shown in Figure 1a. The loss accuracy can be calculated using (8) and (9) respectively. For the values in Figure 1a, the loss and accuracy are is
Suppose that at some later time , the values are those shown in Figure 1b. Most values have improved from to , but a few have worsened. We can again calculate loss and accuracy:
Since , this example satisfies the condition that loss must decrease during training. However, accuracy has fallen considerably. This indicates an unstable classifier. The instability arises because enough objects have sufficiently poor classifications that by improving their classification (increasing ), training can still decrease loss if a few correctly classified objects become misclassified, decreasing accuracy.
2.4 Main Results
As explained above, to establish stability, we must bound in terms of the accuracy at . Assuming that accuracy at is high, we may separate the training set into a large good set for some its small complement . Using (30), we may make the following estimate, details for which are found in Section 4.
(14) 
The first term in (14) is controlled by . If is even moderately large, then the second term dominates (14), with most of the loss coming from a few . It is therefore sufficient to control the distribution of . There are two primary reasons why this distribution may lead to large , both relating to :

There is a single that is very large and negative, that is, .

is not far from zero, but there are many near .
To obtain a good bound on , we must address both issues:

Assume that for all , i.e., the bad set is empty.

Impose a condition that prevents form concentrating near . Such conditions are called small mass conditions since they only concern for in the small mass of objects in . Example 2.1 illustrates this issue.
In the following subsections, we introduce two small mass conditions that lead to stability, and discuss the advantages of each. We will show that each condition leads to small loss and ultimately to stability.
First result
The first small mass condition we will consider ensures that the distribution of values decays very quickly near , the minimum value, so there cannot be a concentration of near , resulting in high accuracy at implying low loss at . Additionally, by applying this condition at all times, not just at , we can improve the estimate (12). How precisely this condition accomplishes these dual purposes is presented in Section 4.1.
Definition 2.1.
The set satisfies condition A at time if there exist constants , , , and so that for and all ,
(15) 
With condition A in hand, we can state the first stability result. Proofs and supporting lemmas are left for Section 3.
Theorem 2.1.
Suppose that with weights is a training set for a classifier such that which satisfies condition A for some constants , , ,and . for all . Then for every there exist such that if good and bad sets at satisfy
(16) 
then for all ,
(17) 
and
(18) 
for all : .
Remark 2.1.
The conclusion of theorem 2.1 depends on the hypothesis that condition A holds. Short of brute force calculation on a casebycase basis, there is at present no way to determine whether condition A holds for a given training set and parameter values , and for which constants , , , and it might hold. Furthermore, Theorem 2.1 does not control the dynamics of training with sufficient precision to guarantee that if condition A holds at then it will also hold for all . Therefore, we have to further strengthen the hypotheses by insisting that condition A holds for all .
Remark 2.2.
Though a more general version of Theorem 2.1 can be proved for and , both the statement and proof are much more tractable with and .
Remark 2.3.
For Theorem 2.1, it is sufficient to choose
(19) 
Unconditional result
Our second condition will also limit the number of values that may cluster near . It does this by ensuring that if a few are clustered near , then there must be more values that are larger than the values in the cluster. This means that not all of the can have near .
Definition 2.2.
The set satisfies condition B at time if there exist a mass and constants , and such that for all intervals ,
(20) 
where is the interval with the same center as but whose width is multiplied by .
Condition B comes with a notable advantage: it is a consequence of a similar condition on the training set called the no small data clusters condition:
Definition 2.3.
Let be the extension of the measure from to by for all . The no small isolated data clusters condition holds if there exists , and so that for each slab in ,
(21) 
where is the slab with the same center as , but whose width is multiplied by .
We can now present our second stability theorem:
Theorem 2.2.
Assume that the set satisfies condition B as given by definition (2.2) at some time . Then for every there exists a constant such that if
(22) 
and if good and bad sets at satisfy
(23) 
then for all ,
(24) 
and
(25) 
For all .
The following theorem guarantees that if the training set satisfies the no small data clusters condition, then the set satisfies condition B, no matter what what the values of the parameters are. Theorem 2.3 is useful because it allows us to establish stability based solely on training set , without needing to begin training. Unlike Theorem 2.1 and 2.2, Theorem 2.3 only applies to classifiers consisting of a DNN with softmax as its last layer.
Theorem 2.3.
Let be a training set for a DNN with weights and whose activation functions are the absolute value function. For all , and , there exists , , and so that if the no small isolated data clusters condition holds on with constants , , and , then condition B holds on with constants , , and . for any .
Examples of applications of Condition A and B
Revisiting example 2.1. 1Conditions and and their associated Theorems 2.1 and 2.2 guarantee stability, so why does stability fail in Example 2.1? The answer lies in the constants found in conditions A and B. Condition is satisfied relative to constants , , , and , while condition B is satisfied relative to constants , , , and . We will determine for which constants conditions and are satisfied. For both conditions, we will consider a typical value of .
Theorem 2.1 requires . Taking , we find via brute force calculation that for the values given in example must exceed . In the proof of Theorem 2.1, we will see that and must be chosen so that However, in Example 2.1,
Therefore, though Theorem 2.1 guarantees the existence of small enough and large enough to get stability, such and for Example 2.1 will not satisfy the primary hypothesis of the theorem, that is .
Theorem 2.2 requires . If , then , so either , which is not allowed in condition B, or . But if , then since , it is impossible to obtain for any .
Since Example 2.1 can only satisfy conditions A and B with constants that are either too large or too small, we cannot apply the stability theorems to it.
Example 2.2.
Suppose a classifier properly classifies all elements in its training set at . In fact, for some large , . How large high will accuracy be at later times?
First suppose that is a twoclass training set for a classifier which satisfies the condition A for all for some constants , , and . Theorem 2.1 tells us that given , if is large enough, then
for all . But how small can be? Remark 2.3 gives a relationship between and , and in particular, provided is sufficiently large, we may choose
Since is large, is small. Therefore, at all later times, we are guaranteed high accuracy. Additionally, good sets also remain large. For example, by choosing
A simple computation with (18) using shows that
for all . Therefore, set remains large for all , the price paid being that . If is very large, we can in fact note that .
Alternatively, we may choose
gives
meaning the median of the distribution of values is greater than with now .
Similarly, if condition B is satisfied at for some constants , , and , instead of condition A, then we can apply theorem 2.2. Using (22), we conclude that letting
we have
Since is small and is large, is also small, so accuracy remains high. By letting , (25), we have
so this good set also remains large but with a significantly worse than for condition A.
Finally, if , then
so again the median of the distribution is greater than . Nevertheless, here we still have that .
2.5 How to verify conditions A and B for a given dataset
In this section, we discuss how to verify conditions A and B and the nosmallisolated data clusters condition to ensure stability of training algorithms in a realworld setting. For completeness, we review notations:

We consider classifiers of the form where is the softmax function and depends on parameters where is the present iteration of the training process.

is a training set for the classifier containing objects . For each , is the index of the correct class of .

Each object has a positive weight with .

.
Now we will explain how to verify each of the conditions, and how to use them to guarantee stability.

Condition A. Train the classifier until a time when a reasonable degree of accuracy is achieved. Calculate the values of for each . One way to do this is to choose the typical values , , and so that the only constant to solve for in (15) is . To this end, make the observation:
(26) Therefore, finding the optimal is a matter of solving a maximization problem. For fixed , the right hand side of (26) is piecewise constant with discontinuities at for . Thus, the maximization problem is solved by sampling the right hand side of (26) at and , and for all and then finding the maximum of the resulting list of samples.
With condition A satisfied for some known constants, one may apply Theorem 2.1. However, as seen with Example 2.1, if is too large, Theorem 2.1 may still require larger than it actually is. It may be possible to decrease by repeating the maximization process with smaller or larger . If an acceptable is found, 2.1 guarantees stability. If not, one may need to train longer to find an acceptable .

Nosmallisolated data clusters condition. Assume that the classifier is a deep neural network with the absolute value function as its activation function. By verifying the nosmallisolated data clusters condition, we are guaranteed that condition B holds for all time. Therefore, we need only verify once that the nosmallisolated data clusters condition holds, which is its principal advantage. To verify this condition, choose constants , , and . Let be the set of all slabs in such that and . We are left with finding the constant which satisfies (21) for all slabs in :
(27) As in verifying condition A, we will solve this minimization problem by discretizing the domain of minimization, , and sampling the objective function only on that discretization. It should be noted that the dimension of is . Since (the dimension of the space containing ) is typically high, this means that a sufficiently fine discretization is necessarily quite large.
After finding , we may be sure that condition B is satisfied at every iteration of the training algorithm for known constants. Therefore, we may apply Theorem 2.2 to guarantee stability.
3 Proof of Theorems 2.1, 2.2 and 2.3
3.1 Elementary estimates
Here, we will show the details and derivations of many several simple equations, inequalities, and some technical lemmas.
Estimates for loss. As mentioned in Section 2.3.1, the quantity facilitates convenient estimates for loss. To start, (6) gives
(28) 
For each , . Using (7) and (28), we obtain estimates on :
(29) 
Finally, (29) gives estimates on loss:
(30) 
In particular, if there are only two classes, the inequalities in (29) and (30) are equalities.
Derivations of (12) and (13). Equations (12) and (13) show how low loss leads to high accuracy. Starting from the lower bound for loss in (30), we make the following estimates for any :
(31) 
Observe that for . It follows that
With the assumption that loss is decreasing, and , we conclude that
On the other hand, we can obtain an improved estimate for by applying (31) with :
Derivation of (14). Equation (14) shows that when is large, the sum (8) is dominated by a few terms that correspond to poorly classified objects. To derive (14), start from the upper bound in (30), and then make the following series of estimates:
The following two technical lemmas will be used in later proofs.
Lemma 3.1.
Suppose . The inequality
can always be satisfied for some
The proof of Lemma 3.1 is essentially a long series of elementary estimates which are not very enlightening. Consequently, it is relegated to the appendix.
Lemma 3.2.
For any and any ,
(32) 
Proof.
Observe that for , one trivially has that
so that