On the Inductive Bias of Dropout
Dropout is a simple but effective technique for learning in neural networks and other settings. A sound theoretical understanding of dropout is needed to determine when dropout should be applied and how to use it most effectively. In this paper we continue the exploration of dropout as a regularizer pioneered by Wager, et.al. We focus on linear classification where a convex proxy to the misclassification loss (i.e. the logistic loss used in logistic regression) is minimized. We show:
when the dropout-regularized criterion has a unique minimizer,
when the dropout-regularization penalty goes to infinity with the weights, and when it remains bounded,
that the dropout regularization can be non-monotonic as individual weights increase from 0, and
that the dropout regularization penalty may not be convex.
This last point is particularly surprising because the combination of dropout regularization with any convex loss proxy is always a convex function.
In order to contrast dropout regularization with regularization, we formalize the notion of when different sources are more compatible with different regularizers. We then exhibit distributions that are provably more compatible with dropout regularization than regularization, and vice versa. These sources provide additional insight into how the inductive biases of dropout and regularization differ. We provide some similar results for regularization.
Since its prominent role in a win of the ImageNet Large Scale Visual Recognition Challenge (Hinton, 2012; Hinton et al., 2012), there has been intense interest in dropout (see the work by Dahl (2012); L. Deng (2013); Dahl et al. (2013); Wan et al. (2013); Wager et al. (2013); Baldi and Sadowski (2013); Van Erven et al. (2014)). This paper studies the inductive bias of dropout: when one chooses to train with dropout, what prior preference over models results? We show that dropout training shapes the learner’s search space in a much different way than or regularization. Our results shed new insight into why dropout prefers rare features, how the dropout probability affects the strength of regularization, and how dropout restricts the co-adaptation of weights.
Our theoretical study will concern learning a linear classifier via convex optimization. The learner wishes to find a parameter vector so that, for a random feature-label pair drawn from some joint distribution , the probability that is small. It does this by using training data to try to minimize , where is the loss function associated with logistic regression.
We have chosen to focus on this problem for several reasons. First, the inductive bias of dropout is not well understood even in this simple setting. Second, linear classifiers remain a popular choice for practical problems, especially in the case of very high-dimensional data. Third, we view a thorough understanding of dropout in this setting as a mandatory prerequisite to understanding the inductive bias of dropout when applied in a deep learning architecture. This is especially true when the preference over deep learning models is decomposed into preferences at each node. In any case, the setting that we are studying faithfully describes the inductive bias of a deep learning system at its output nodes.
We will borrow the following clean and illuminating description of dropout as artificial noise due to Wager et al. (2013). An algorithm for linear classification using loss and dropout updates its parameter vector online, using stochastic gradient descent. Given an example , the dropout algorithm independently perturbs each feature of : with probability , is replaced with , and, with probability , is replaced with . Equivalently, is replaced by , where
before performing the stochastic gradient update step. (Note that, while obviously depends on , if we sample the components of independently of one another and , by choosing with the dropout probability , then we may write .)
Stochastic gradient descent is known to converge under a broad variety of conditions (Kushner and Yin, 1997). Thus, if we abstract away sampling issues as done by Breiman (2004); Zhang (2004); Bartlett et al. (2006); Long and Servedio (2010), we are led to consider
as dropout can be viewed as a stochastic gradient update of this global objective function. We call this objective the dropout criterion, and it can be viewed as a risk on the dropout-induced distribution. (Abstracting away sampling issues is consistent with our goal of concentrating on the inductive bias of the algorithm. From the point of view of a bias-variance decomposition, we do not intend to focus on the large-sample-size case, where the variance is small, but rather to focus on the contribution from the bias where could be an empirical sample distribution. )
We start with the observation of Wager et al. (2013) that the dropout criterion may be decomposed as
where is non-negative, and depends only on the marginal distribution over the feature vectors (along with the dropout probability ), and not on the labels. This leads naturally to a view of dropout as a regularizer.
A popular style of learning algorithm minimizes an objective function like the RHS of (1), but where is replaced by a norm of . One motivation for algorithms in this family is to first replace the training error with a convex proxy to make optimization tractable, and then to regularize using a convex penalty such as a norm, so that the objective function remains convex.
We show that formalizes a preference for classifiers that assign a very large weight to a single feature. This preference is stronger than what one gets from a penalty proportional to . In fact, we show that, despite the convexity of the dropout risk, is not convex, so that dropout provides a way to realize the inductive bias arising from a non-convex penalty, while still enjoying the benefit of convexity in the overall objective function (see the plots in Figures 1, 2 and 3). Figure 1 shows the even more surprising result that the dropout regularization penalty is not even monotonic in the absolute values of the individual weights.
It is not hard to see that . Thus, if is greater than the expected loss incurred by (which is ), then it might as well be infinity, because dropout will prefer to . However, in some cases, dropout never reaches this extreme – it remains willing to use a model, even if its parameter is very large, unlike methods that use a convex penalty. In particular,
for all , no matter how large gets; of course, the same is true for the other features. On the other hand, except for some special cases (which are detailed in the body of the paper),
goes to infinity with . It follows that cannot be approximated to within any factor, constant or otherwise, by a convex function of .
To get a sense of which sources dropout can be successfully applied to, we compare dropout with an algorithm that regularizes using , by minimizing the criterion:
Will will use “” as a shorthand to refer to an algorithm that minimizes (2). Note that , the probability of dropping out an input feature, plays a role in dropout analogous to . In particular, as goes to zero the examples remain unperturbed and the dropout regularization has no effect.
Informally, we say that joint probability distributions and separate dropout from if, when the same parameters and are used for both and , then using dropout leads to a much more accurate hypothesis for , and using leads to a much more accurate hypothesis for . This enables us to illustrate the inductive biases of the algorithms through the use of contrasting sources that either align or are incompatible with the algorithms’ inductive bias. Comparing with another regularizer helps to restrict these illustrative examples to “reasonable” sources, which can be handled using another regularizer. Ensuring that the same values of the regularization parameter are used for both and controls for the amount of regularization, and ensures that the difference is due to the model preferences of the respective regularizers. This style of analysis is new, as far as we know, and may be a useful tool for studying the inductive biases of other algorithms and in other settings.
Related previous work. Our research builds on the work of Wager et al. (2013), who analyzed dropout for random pairs where the distribution of given comes from a member of the exponential family, and the quality of a model is evaluated using the log-loss. They pointed out that, in these cases, the dropout criterion can be decomposed into the original loss and a term that does not depend on , which therefore can be viewed as a regularizer. They then proposed an approximation to this dropout regularizer, discussed its relationship with other regularizers and training algorithms, and evaluated it experimentally. Baldi and Sadowski (2013) exposed properties of dropout when viewed as an ensemble method (see also Bachman et al. (2014)). Van Erven et al. (2014) showed that applying dropout for online learning in the experts setting leads to algorithms that adapt to important properties of the input without requiring doubling or other parameter-tuning techniques, and Abernethy et al. (2014) analyzed a class of methods including dropout by viewing these methods as smoothers. The impact of dropout on generalization (roughly, how much dropout restricts the search space of the learner, or, from a bias-variance point of view, its impact on variance) was studied by Wan et al. (2013) and Wager et al. (2014). The latter paper considers a variant of dropout compatible with a poisson source, and shows that under some assumptions this dropout variant converges more quickly to its infinite sample limit than non-dropout training, and that the Bayes-optimal predictions are preserved under the modified dropout distribution. Our results complement theirs by focusing on the effect of the original dropout on the algorithm’s bias.
Section 2 defines our notation and characterizes when the dropout criterion has a unique minimizer. Section 3 presents many additional properties of the dropout regularizer. Section 4 formally defines when two distributions separate two algorithms or regularizers. Sections 5 and 6 give sources over that separate dropout and . Section 7 provides plots demonstrating that the same distributions separated dropout from regularization. Sections 8 and 9 give separation results from with many features.
We use for the optimizer of the dropout criterion, for the probability that a feature is dropped out, and for the probability that a feature is kept throughout the paper. As in the introduction, if and is a joint distribution over , define
where for sampled independently at random from with , and is the logistic loss function:
For some analyses, an alternative representation of will be easier to work with. Let be sampled randomly from , independently of and one another, with . Defining , we have the equivalent definition
To see that they are equivalent, note that
Although this paper focuses on the logistic loss, the above definitions can be used for any loss function . Since the dropout criterion is an expectation of , we have the following obvious consequence.
If loss is convex, then the dropout criterion is also a convex function of .
Similarly, we use for the optimizer of the regularized criterion:
It is not hard to see that the term implies that is always well-defined. On the other hand, is not always well-defined, as can be seen by considering any distribution concentrated on a single example. This motivates the following definition.
Let be a joint distribution with support contained in . A feature is perfect modulo ties for if either for all in the support of , or for all in the support of .
Put another way, is perfect modulo ties if there is a linear classifier that only pays attention to feature and is perfect on the part of where is nonzero.
For all finite domains , all distributions with support in , and all , we have that has a unique minimum in if and only if no feature is perfect modulo ties for .
Proof: Assume for contradiction that feature is perfect modulo ties for and some is the unique minimizer of . Assume w.l.o.g. that for all in the support of (the case where is analogous). Increasing keeps the loss unchanged on examples where and decreases the loss on the other examples in the support of , contradicting the assumption that was a unique minimizer of the expected loss.
Now, suppose then each feature has both examples where and examples where in the support of . Since the support of is finite, there is a positive lower bound on the probability of any example in the support. With probability , component of random vector is non-zero and the remaining components are all zero. Therefore as increases without bound in the positive or negative direction, also increases without bound. Since , there is a value depending only on distribution and the dropout probability such that minimizing over is equivalent to minimizing over . Since for all , has full rank and therefore is strictly convex. Since a strictly convex function defined on a compact set has a unique minimum, has a unique minimum on , and therefore on .
See Table 1 for a summary of the notation used in the paper.
3 Properties of the Dropout Regularizer
We start by rederiving the regularization function corresponding to dropout training previously presented in Wager et al. (2013), specialized to our context and using our notation. The first step is to write in an alternative way that exposes some symmetries:
This then implies
Since , we get the following.
(Wager et al., 2013)
Using a Taylor expansion, Wager et al. (2013) arrived at the following approximation:
This approximation suggests two properties: the strength of the regularization penalty decreases exponentially in the prediction confidence , and that the regularization penalty goes to infinity as the dropout probability goes to 1. However, can be quite large, making a second-order Taylor expansion inaccurate.111Wager et al. (2013) experimentally evaluated the accuracy of a related approximation in the case that, instead of using dropout, was distributed according to a zero-mean gaussian. In fact, the analysis in this section suggests that the regularization penalty does not decrease with the confidence and that the regularization penalty increases linearly with (Figure 1, Theorem 8, Proposition 9).
The following propositions show that satisfies at least some of the intuitive properties of a regularizer.
Proof: The proposition follows from Jensen’s Inequality.
The vector learned by dropout training minimizes . However, the vector has and , implying:
Thus any regularization penalty greater than is effectively equivalent to a regularization penalty of .
We now present new results based on analyzing the exact . The next properties show that the dropout regularizer is emphatically not like other convex or norm-based regularization penalties in that the dropout regularization penalty always remains bounded when a single component of the weight vector goes to infinity (see also Figure 1).
For all dropout probabilities , all , all marginal distributions over -feature vectors, and all indices ,
Proof: Fix arbitrary , , , and . We have
Fix an arbitrary in the support of and examine the expectation over for that . Recall that is 0 with probability and is with probability , and we will use the substitution .
We now consider cases based on whether or not is 0. When (so either or is ) then (10) is also 0.
If then consider the derivative of (10) w.r.t. , which is
This derivative is positive since and . Therefore (10) is bounded by its limit as , which is , in this case.
Under the conditions of Theorem 8,
Note that this bound on the regularization penalty depends neither on the range nor expectation of . In particular, it has a far different character than the approximation of Equation (8).
In Theorem 8 the other weights are fixed at 0 as goes to infinity. An additional assumption implies that the regularization penalty remains bounded even when the other components are non-zero. Let be a weight vector such that for all in the support of and dropout noise vectors we have for some bound (this implies that also). Then
Under the conditions of Theorem 8, if the weight vector has the property that for each in the support of and all of its corresponding dropout noise vectors then
Proposition 10 shows that the regularization penalty starting from a non-zero initial weight vector remains bounded as any one of its components goes to infinity. On the other hand, unless is small, the bound will be larger than the dropout criterion for the zero vector. This is a natural consequence as the starting weight vector could already have a large regularization penalty.
The derivative of (10) in the proof of Theorem 8 implies that the dropout regularization penalty is monotonic in when the other weights are zero. Surprisingly, this is does not hold in general. The dropout regularization penalty due to a single example (as in Proposition 6) can be written as
Therefore if increasing a weight makes the second logarithm increase faster than the expectation of the first, then the regularization penalty decreases even as the weight increases. This happens when the products tend to have the same sign. The regularization penalty as a function of for the single example , , and set to various values is plotted in Figure 1222Setting is in some sense without loss of generality as the prediction and dropout regularization values for any , pair are identical to the values for , when each . . This gives us the following.
Unlike p-norm regularizers, the dropout regularization penalty is not always monotonic in the individual weights.
In fact, the dropout regularization penalty can decrease as weights move up from 0.
Fix , , and an arbitrary . Let be the distribution concentrated on . Then locally decreases as increases from .
We now turn to the dropout regularization’s behavior when two weights vary together. If any features are always zero then their weights can go to without affecting either the predictions or . Two linearly dependent features might as well be one feature. After ruling out degeneracies like these, we arrive at the following theorem, which is proved in Appendix B.
Fix an arbitrary distribution with support in , weight vector , and non-dropout probability . If there is an with positive probability under such that and are both non-zero and have different signs, then the regularization penalty goes to infinity as goes to .
The theorem can be straightforwardly generalized to the case ; except in degenerate cases, sending two weights to infinity together will lead to a regularization penalty approaching infinity.
Theorem 13 immediately leads to the following corollary.
For a distribution with support in , if there is an with positive probability under such that and , then there is a such that for any , the regularization penalty goes to infinity with .
For any with both components nonzero, there is a distribution over with bounded support such that the regularization penalty goes to infinity with .
Together Theorems 8 and 13 demonstrate that is not convex (see also Figure 1). In fact, cannot be approximated to within any factor by a convex function, even if a dependence on and is allowed. For example, Theorem 8 shows that, for all with bounded support, both and remain bounded as goes to infinity, whereas Theorem 13 shows that there is such a such that is unbounded as goes to infinity.
Theorem 13 relies on the products having different signs. The following shows that does remain bounded when multiple components of go to infinity if the corresponding features are compatible in the sense that the signs of are always in alignment.
Let be a weight vector and be a discrete distribution such that for each index and all in the support of . The limit of as goes to infinity is bounded by .
The bounds in the preceding theorems and propositions suggest several properties of the dropout regularizer. First, the factors indicate that the strength of regularization grows linearly with dropout probability . Second, the factors in several of the bounds suggest that weights for rare features are encouraged by being penalized less strongly than weights for frequent features. This preference for rare features is sometimes seen in algorithms like the Second-Order Perceptron (Cesa-Bianchi et al., 2002) and AdaGrad (Duchi et al., 2011). Wager et al. (2013) discussed the relationship between dropout and these algorithms, based on approximation (8). Empirical results indicate that dropout performs well in domains like document classification where rare features can have high discriminative value (Wang and Manning, 2013). The theorems of this section suggest that the exact dropout regularizer minimally penalizes the use of rare features. Finally, Theorem 13 suggests that dropout limits co-adaptation by strongly penalizing large weights if the products often have different signs. On the other hand, if the products usually have the same sign, then Proposition 12 indicates that dropout encourages increasing the smaller weights to help share the prediction responsibility. This intuition is reinforced by Figure 1, where the dropout penalty for two large weights is much less then a single large weight when the features are highly correlated.
4 A definition of separation
Now we turn to illustrating the inductive bias of dropout by contrasting it with regularization. For this, we will use a definition of separation between pairs of regularizers.
Each regularizer has a regularization parameter that governs how strongly it regularizes. If we want to describe qualitatively what is preferred by one regularizer over another, we need to control for the amount of regularization.
Let , and recall that and are the minimizers of the dropout and -regularized criteria respectively.
Say that sources and -separate and dropout if there exist and such that both and . Say that indexed families and strongly separate and dropout if pairs of distributions in the family -separate them for arbitrarily large . We provide strong separations, using both and larger .
5 A source preferred by
Consider the joint distribution defined as follows:
This distribution has weight vectors that classify examples perfectly (the green shaded region in Figure 2). For this distribution, optimizing an -regularized criterion leads to a perfect hypothesis, while the weight vectors optimizing the dropout criterion make prediction errors on one-third of the distribution.
The intuition behind this behavior for the distribution described in (12) is that weight vectors that are positive multiples of classify all of the data correctly. However, with dropout regularization the and data points encourage the second weight to be negative when the first component is dropped out. This negative push on the second weight is strong enough to prevent the minimizer of the dropout-regularized criterion from correctly classifying the data point. Figure 2 illustrates the loss, dropout regularization, and dropout and criterion for this data source.
If , then for the distribution defined in (12).
In contrast, the minimizing the dropout criterion (3) has error rate at least .
If then for the distribution defined in (12).
6 A source preferred by dropout
In this section, consider the joint distribution defined by
The intuition behind this distribution is that the data point encourages a large weight on the first feature. This means that the negative pressure on the second weight due to the data point is much smaller (especially given its lower probability) than the positive pressure on the second weight due to the example. The regularized criterion emphasizes short vectors, and prevents the first weight from growing large enough (relative to the second weight) to correctly classify the data point. On the other hand, the first feature is nearly perfect; it only has the wrong sign on the second example where it is . This means that, in light of Theorem 8 and Proposition 10, dropout will be much more willing to use a large weight for , giving it an advantage for this source over . The plots in Figure 3 illustrate this intuition.
If , then for the distribution defined in (13).
In contrast, the minimizer of the dropout criterion is able to generalize perfectly.
If , then for the distribution defined in (13).
The results in this and the previous section show that the distributions defined in (12) and (13) strongly separate dropout and regularization. Theorem 19 shows that for distribution analyzed in this section for all while Theorem 18 shows that for the same distribution whenever . In contrast, when is the distribution defined in the previous section, Theorem 16 shows whenever . For this same distribution , Theorem 17 shows that whenever .
In this section, we show that the same and distributions that separate dropout from regularization also separate dropout from regularization: the algorithm the minimizes
As in Sections 5 and 6, we set . Figure 4 plots the criterion (14) for the distributions defined in (12) and defined in (13). Like regularization, regularization produces a Bayes-opitmal classifier on , but not on . Therefore the same argument shows that these distributions also strongly separate dropout and regularization.
8 A high-dimensional source preferred by
In this section we exhibit a source where regularization leads to a perfect predictor while dropout regularization creates a predictor with a constant error rate.
Consider the source defined as follows. The number of features is even. All examples are labeled . A random example is drawn as follows: the first feature takes the value with probability and otherwise, and a subset of exactly of the remaining features (chosen uniformly at random) takes the value , and the remaining of those first features take the value .
A majority vote over the last features achieves perfect prediction accuracy. This is despite the first feature (which does not participate in the vote) being more strongly correlated with the label than any of the voters in the optimal ensemble. Dropout, with its bias for single good features and discrimination against multiple disagreeing features, puts too much weight on this first feature. In contrast, regularization leads to the Bayes optimal classifier by placing less weight on the first feature than on any of the others.
If then the weight vector optimizing the criterion has perfect prediction accuracy: .
When , dropout with fails to find the Bayes optimal hypothesis. In particular, we have the following theorem.
If the dropout probability and the number of features is an even then the weight vector optimizing the dropout criterion has prediction error rate .
We conjecture that dropout fails on for all . As evidence, we analyze the case.
If dropout probability and the number of features is then the minimizer of the dropout criteria has has prediction error rate .
9 A high-dimensional source preferred by dropout
Define the source , which depends on (small) positive real parameters , , and , as follows. A random label is generated first, with both of and equally likely. The features are conditionally independent given . The first feature tends to be accurate but small: with probability , and is with probability . The remaining features are larger but less accurate: for , feature is with probability , and otherwise.
When is small enough relative to , the Bayes’ optimal prediction is to predict with the first feature. When is small, this requires concentrating the weight on to outvote the other features. Dropout is capable of making this one weight large while regularization is not.
If , , , , and , then
If , , , and is a large enough even number, then for any ,
Let be a large enough even number in the sense of Theorem 24. Let be the distribution defined at the start of Section 9 with number of features , , , and is a free parameter. Theorem 23 shows that when dropout probability . For this same distribution, Theorem 24 shows when . Therefore
goes to 0 as .
We have built on the interpretation of dropout as a regularizer in Wager et al. (2013) to prove several interesting properties of the dropout regularizer. This interpretation decomposes the dropout criterion minimized by training into a loss term plus a regularization penalty that depends on the feature vectors in the training set (but not the labels). We started with a characterization of when the dropout criterion has a unique minimum, and then turn to properties of the dropout regularization penalty. We verified that the dropout regularization penalty has some desirable properties of a regularizer: it is 0 at the zero vector, and the contribution of each feature vector in the training set is non-negative.
On the other hand, the dropout regularization penalty does not behave like standard regularizers. In particular, we have shown:
Although the dropout “loss plus regularization penalty” criterion is convex in the weights , the regularization penalty imposed by dropout training is not convex.
Starting from an arbitrary weight vector, any single weight can go to infinity while the dropout regularization penalty remains bounded.
In some cases, multiple weights can simultaneously go to infinity while the regularization penalty remains bounded.
The regularization penalty can decrease as weights increase from 0 when the features are correlated.
These are in stark contrast to standard norm-based regularizers that always diverge as any weight goes to infinity, and are non-decreasing in each individual weight.
In most cases the dropout regularization penalty does diverge as multiple weights go to infinity. We characterize when sending two weights to infinity causes the dropout regularization penalty to diverge, and when it will remain finite. In particular, dropout is willing to put a large weights on multiple features if the products tend to have the same sign.
The form of our analytical bounds suggest that the strength of the regularizer grows linearly with the dropout probability , and provide additional support for the claim (Wager et al., 2013) that dropout favors rare features.
We found it important to check our intuition by working through small examples. To make this more rigorous we needed a definition of when a source favored dropout regularization over a more standard regularizer like . Such a definition needs to deal with the strength of regularization, a difficulty complicated by the fact that dropout regularization is parameterized by the dropout probability while regularization is parameterized by . Our solution is to consider pairs of sources and . We then say the pair separates the dropout and if dropout with a particular parameter performs better then