Regularized deep learning with non-convex penalties

Regularized deep learning with non-convex penalties

Sujit Vettam1, Majnu John2 The University of Chicago Booth School of Business,
Chicago, IL.
Department of Mathematics,
Hofstra University,
Hempstead, NY.
Center for Psychiatric Neuroscience,
Feinstein Institute of Medical Research,
Manhasset, NY.
Division of Psychiatry Research, Zucker Hillside,
Northwell Health System,
Glen Oaks, NY.
11Corresponding author, e-mail: sjv@chicagobooth.edu, svettam@uchicago.edu
22Corresponding author, address: Community Drive, Manhasset, NY 11030. e-mail: mjohn5@northwell.edu, majnu.john@hofstra.edu, Phone: +01 718 470 8221, Fax: +01 718 343 1659
Abstract

Regularization methods are often employed in deep learning neural networks (DNNs) to prevent overfitting. For penalty based methods for DNN regularization, typically only convex penalties are considered because of their optimization guarantees. Recent theoretical work have shown that non-convex penalties that satisfy certain regularity conditions are also guaranteed to perform well with standard optimization algorithms. In this paper, we examine new and currently existing non-convex penalties for DNN regularization. We provide theoretical justifications for the new penalties and also assess the performance of all penalties on DNN analysis of real datasets.

keywords:
deep learning, neural network, regularization, lasso, non-convex penalty
journal: journal arXiv

1 Introduction

The success of DNNs in learning complex relationships between inputs and outputs may be mainly attributed to multiple non-linear hidden layers [1,2]. As a consequence of having multiple layers, DNNs typically have tens of thousands of parameters, sometimes even millions. Such large number of parameters gives the method incredible amount of flexibility. However on the downside, this may lead to overfitting the data, especially if the training sample is not large enough. Overfitting means that the method may work well in the training set but not in the test set. Since overfitting is a typical problem for DNNs, many methods have been suggested to reduce it. Adding weight penalties to the cost function, drop-out, early stopping, max-norm regularization and data augmentation are some of the popular regularization methods used to avoid overfitting. In this paper, we narrow our focus to regularization methods based on weight penalties appended to the cost function.

Two most commonly considered penalties for DNN regularization are the and penalties. In statistical literature, these two penalties are known as Lasso [3] and Ridge penalties [4,5] respectively. One of the main advantages of working with these two penalties is convexity of the optimization problem which guarantees that a local optimum will always be a global optimum. penalization is also a selection procedure as it sets many parameters to zero. penalization does not have this property. All the parameters after penalization and all non-zero parameters after penalization are shrunk towards zero. The resulting bias in the regularized solution of the above convex penalties has motivated a few authors to consider nonconvex penalties [6,7], which have the potential to yield nearly unbiased estimates for the parameters. Recent theoretical work [8,9] has also shown that although nonconvex regularizers may yield multiple local optima they are essentially as good as a global optimum from a statistical perspective.

In this paper we present nonconvex penalty functions which could be utilized as regularizers of the parameters in a DNN. In the method section we motivate the definition of these penalty functions based on the norm. The main focus of our paper is to compare the performance of DNN-regularization based on nonconvex penalties with regularization based on convex penalty functions. We provide theoretical justifications for our proposed regularization approaches and also assess their performance on real datasets.

The paper is structured as follows. In section 2, we motivate and introduce our method for regularizing DNNs, and justify based on theoretical considerations. In section 3, we apply our method to real datasets and compare its performance with regularization. Finally we make our conclusions in section 4.

2 Methods

2.1 Background and Motivation

Consider a classifier parameterized by the weight vector , for input and categorical output . Optimal weights in a non-regularized setting are obtained by minimizing a cost function . Typically the negative log-likelihood is taken as the cost function; in the case of a categorical output it will be the cross-entropy function. One general approach for regularizing DNNs is to append a penalty function to the cost function, where denotes the vector of tuning parameters associated with the penalty function. As done in most of the literature we will be restricting our attention to co-ordinate separable penalty functions which could be expressed as a sum

Thus the regularized optimization problem that we are interested in is

(1)

The most commonly discussed approach, known as the ‘canonical selection procedure’ is based on

(2)

the penalty function in this case is referred to as the norm. The key ideas behind Akaike’s, Bayesian and Minimax Risk based Information Criteria (AIC, BIC and RIC), and Mallow’s are all based on the above norm. However, it is intractable for DNN applications because finding the minimum of the objective function in (1) with the penalty function in (2) is in general NP hard. It is combinatorial in nature and has exponential complexity as it requires an exhaustive search of order .

The above-mentioned intractability has led to considerations of approximations for the penalty function in (2). The most widely considered approximations are of the class of Bridge functions [10,11]

motivated by the fact that

and cases ( and penalties) are known in the literature as Lasso and Ridge penalties. Note that the penalty function in (2) is singular at zero and the optimization problem based on it is non-convex. Bridge penalty functions are convex when and non-convex for . Bridge functions are singular at zero only in the case . Thus Lasso is the only case among the class of Bridge functions which is both convex and has a singularity at origin. Convex relaxation of a non-convex problem has its advantage in the optimization setting based on the simple fact that the local minimum of a convex function is also a global minimum. Singularity at origin for the penalty function is essentially what guarantees the sparsity of the solution (i.e. setting to zero small estimated weights to reduce model complexity).

Although Lasso has the above-mentioned advantages over other Bridge estimators, it differs from the norm in a crucial aspect: where as the norm is constant for any nonzero argument, the norm increases linearly with the absolute value of the argument. This linear increase results in a bias for the -regularized solution [6] which in turn could lead to modeling bias. As mentioned in [6], in addition to unbiasedness and sparsity, a good penalty function should result in an estimator with continuity property. Continuity is necessary to avoid instability in model prediction. Note that the penalty function in (2) does not satisfy the continuity criterion. None of the Bridge penalty functions satisfy simultaneously all of the preceding three required properties. The solution for Bridge penalties is continuous only when . However, when the Bridge penalties do not produce sparse solutions. When (i.e. Lasso) it produces continuous and sparse solution, but this comes at the price of shifting the resulting estimator by a constant (i.e. bias).

The above issues for the Bridge functions have led to considerations of other approximations for the penalty function in (2) (especially non-convex approximations) with the hope that these new approximations will satisfy (or nearly satisfy) all the three desirable properties mentioned above. In this paper we present two non-convex approximation functions:

(3)
(4)

The first penalty has appeared previously in the medical imaging literature (under a slightly different guise) in a method for magnetic resonance image reconstruction [12], and it has been referred to as the Laplace penalty function. See also [13]. The second penalty function based on arctan has not been considered in the literature so far to the best of our knowledge. Two other non-convex penalties that currently exist in the literature are the SCAD penalty,

developed by Fan and Li (2001) and the MCP regularizer (Zhang 2010),

There are two other non-convex penalties that have appeared in the literature previously, that we do not consider in this paper. We present these two penalties in a later section and provide reasons for not considering them.

Although non-convex penalties are worth considering in DNN applications, they rarely get as much attention as the convex penalty functions. For example, textbooks such as [14] mention only and as regularization methods based on weight penalties. In this paper, we compare the performance of non-convex regularizers (Laplace, Arctan, SCAD and MCP) with and regularizer for DNNs.

2.2 Theoretical considerations

Properties of SCAD and MCP penalties have been studied in the original papers in which they were presented. Below we present a few properties satisfied by Laplace and arctan penalty functions. These properties will help us to apply theorems from existing literature [6,8,9] that guarantees that any local optimum lies close to the target vector . These properties are easy to see from plots, but we give proofs. We also present a statistical consistency result in the appendix by adapting slightly the theorems presented in [20] and [21].

Properties of the Laplace penalty function

We begin with a useful lemma.

Lemma 2.1.

For and ,

(5)
Proof.

Let . Note that based on the assumptions. Taking logarithm on both sides of the inequality (5), we get Multiplying by -1 on both sides and substituting , we get . But this follows from the inequality for all (in particular for ) and the fact that . ∎

We present a few properties satisfied by the penalty function,

(P1) and is symmetric around zero. It is easily verified.

(P2) is increasing for . It is easy to see that is positive for and .

(P3) For , the function is non-increasing in . Since, for ,

it suffices to show that the numerator for . But this follows from Lemma 2.1 above.

(P4) The function is differentiable for all and subdifferentiable at , with . It is easy to see that any point in the interval is a subgradient of at .

(P5) There exists such that is convex: will work. is a measure of the severity of non-convexity of the penalty function.

Since the penalty function satisfy the properties (P1) to (P5), we have is everywhere differentiable. These properties also imply that is -Lipschitz as a function of [8]. In particular, all subgradients and derivatives of are bounded in magnitude by [8]. We also see that for empirical loss satisfying restricted strong convexity condition and conditions for and sample size in Theorem 1 in [8], the squared -error of the estimator grows proportionally with the number of nonzeros in the target parameter and with . One condition that is not satisfied by is

(P6) There exists such that It is clear that such a does not exist for our penalty function. However, we note that can be made arbitrarily close to zero for large . In other words, the following property is satisfied.

(P6)

(P6) and (P6) are related to unbiasedness as mentioned in Fan and Li (2001). (P6) guarantees unbiasedness (and (P6) near-unbiasedness) when the true unknown parameter is large to avoid unnecessary modeling bias.

Laplace penalty function depends on two parameters and , while as Lasso and Ridge penalties depend on only alone. In our case, is there an optimal choice of that depends on alone? The following considerations based on Fan and Li shed some light into this. According to Fan and Li [6] a good penalty function should have the following two properties.

(P7) Minimum of the function is positive. This property guarantees sparsity, at least in the empirical loss case; that is, the resulting estimator is a thresholding rule.

(P8) Minimum of the function is attained at . This property, at least in the empirical loss case, is related to the continuity of the resulting estimator. Continuity helps to avoid instability in model prediction.

Consider the function , which is symmetric around zero, so that if a minimum is attained at , then it is attained at as well. This allows us to restrict our attention to the domain . Note that in this domain and . We have so that or

(6)

In order that , we require that ; satisfies (P8) if

(7)

For any non-negative given by eq.(6) (i.e. with any )

so, in particular, corresponding to is positive. Thus for a given , choosing based on eq.(7) will ensure that properties (P7) and (P8) are satisfied by Laplace penalty function.

Theorem 1 in Fan and Li [6] provides required conditions for the -consistency of the estimator in a maximum likelihood framework and generalized linear models setting. The main assumption on the penalty function is stated as the following property.

(P9) In our case, this property is satisfied because

Lemma 2.1 suggests considering another penalty function where

This penalty function () is equivalent to Geman’s penalty function mentioned in [15]; also mentioned in [12,13]. Most of the properties listed above are satisfied by the penalty function as well. For example,

In order to check (P3) we consider

For , it is easy to check that

However, this suggests that the required for (P5) is which is twice as that for . That is, non-convexity for is twice as severe for . Also converges to zero (as do ) for large satisfying (P6) for near-unbiasedness. However, since it can be shown that

we see that the convergence for is faster. Hence we do not consider the latter penalty function, in this paper.

We may also generalize our penalty function to for . However, the corresponding to this function will be , making it more severely non-convex similar to the Bridge penalty function when . Hence, in this paper we focus only on case. Further comparison with bridge penalty is given in the subsection below.

Properties of the arctan penalty function

Here we check properties for the arctan penalty,

Property P1 ( - and is symmetric around zero - ) is again easily verified.

(P2): For ,

is positive for . Hence is increasing for .

We state as a lemma a well-known fact about arctan function.

Lemma 2.2.

For ,

(8)
Proof.

If we take , then for and hence is non-decreasing in the interval . In particular , which proves the right inequality. Similarly, by writing , we have for , proving the left inequality. ∎

(P3) For consider the function .

Thus by the above lemma and hence is non-increasing.

(P4) . Any point in the interval is a subgradient of at .

(P5) makes convex.

(P6’) It is easy to check that

Also easy to see that which guarantees that (P9) is satisfied.

Convergence of the Laplace and arctan approximation functions

Here we present heuristic justifications for using the Laplace or arctan penalties over the bridge penalties by considering their respective error in approximating the indicator function involved in the canonical selection procedure (eq. (2)).

Lemma 2.3.

Consider the approximation functions for , and for some fixed . The overall error for is much larger than that of .

Proof.

We give a proof based on heuristic analysis. Because of symmetry we just focus on the right side of origin on the x-axis for error analysis. For an interval , with , , we have , so that the area under the curve in this interval is

The area under the curve for the indicator function in this interval is , so that the error in approximation is

where we used the leading term of the Taylor series approximation to the function . For the approximation function the area under the curve in the interval with is

Using the leading term in the Taylor series approximation, can be approximated by so that the absolute value of the error is approximately . Thus the absolute value of the error for in a unit interval (with ) is approximately and that for in the same interval is approximately . For a fixed , the former can be made arbitrarily large, and the latter arbitrarily small by increasing . The error for is larger than that for in the interval but the difference in this interval is bounded. ∎

Lemma 2.4.

Consider the approximation functions for , and where is fixed. The overall error for is much larger than that of .

Proof.

In this case

where the approximate equality above was obtained using the leading term in the Taylor series expansion of the each of the following functions:

Thus the absolute value of the error for in a unit interval (with ) is approximately

which can be made arbitrarily small by increasing , since for , and increases to as . On the other hand, as shown in the previous lemma, the corresponding error for can be made arbitrarily large by increasing . Also the error for and in the interval is bounded. ∎

Two other non-convex penalties

Here we mention two other non-convex penalties that have appeared in the regularization literature previously. The first one is the Geman-McClure function

this function is exactly same as the function mentioned above if we replace with . As mentioned above, the function is related to the Laplace penalty via Lemma 2.1. For the same parameter the non-convexity for is twice as that for the Laplace penalty. It can also be shown that the derivative of the Laplace penalty converges to zero at a faster rate than the that of . Based on these considerations we did not study the Geman-McClure function in this paper.

Yet another non-convex penalty that has appeared in the literature is the concave logarithmic penalty

This function increases with the absolute value of the argument like and penalties; although the increase is at a lower rate than and for large , it is still an increasing function thereby resulting in bias. Hence we do not consider this latter penalty as well in this paper.

3 Experimental results

We assess the performance of regularized DNNs with the non-convex penalty functions presented in this paper, by applying them on three real datasets (MNIST, FMNIST and RCV1). Details of the analysis and description of the three datasets are given below.

The optimal weights of the fitted deep neural networks (DNN) were estimated by minimizing the total cross entropy loss function. We used batch gradient descent algorithm with early stopping. To avoid the vanishing/exploding gradients problem, the weights were initialized to values obtained from a normal distribution with mean zero and variance where is the number of neurons in the layer [16, 17]. Rectified linear units (ReLU) function was used as the activation function.

The training data was randomly split into multiple batches. During each epoch, the gradient descent algorithm was sequentially applied to each of these batches resulting in new weights estimates. At the end of each epoch, the total validation loss was calculated using the validation set. When twenty consecutive epochs failed to improve the total validation loss, the iteration was stopped. The maximum number of epochs was set at 250. The weights estimate that resulted in the lowest total validation loss was selected as the final estimate. Since there was a random aspect to the way the training sets were split into batches, the whole process was repeated three times with seed values 1, 2, and 3. The reported test error rates are the median of the three test error rates obtained using each of these seed values.

A triangular learning rate schedule was used because it produced the lowest test error rates [18]. The learning rates varied from a minimum of 0.01 to a maximum of 0.25 (see figure 1 below).

Figure 1: Learning rate plot

For all penalty functions the optimal was found by fitting models with logarithmically equidistant values in a grid. We used Python ver. 3.6.7rc2 and TensorFlow ver.1.12.0 for the calculations.

The models were fit with no regularization, and regularizations and the non-convex regularization methods. The results based on new non-convex penalty functions were comparable to or better than and regularization in all the datasets. A general overview of the datasets and the DNN model specifications is given in Table 1. The models were intentionally overparameterized to better contrast the effects of various types of regularization methods.

Table 1. Overview of dataset and DNN model specifications
Dataset Domain Dimensionality Classes DNN Training Validation Test
Specifications Set Set Set
MNIST Visual 784 ( 10 5 layers, 48000 2000 10000
greyscale) 1024 units
FMNIST Visual 784 ( 10 5 layers, 45000 5000 10000
greyscale) 1024 units

Reuters
Text 47236 50 5 layers, 13355 2000 49565
512 units

For non-convex penalties there is an extra parameter that we have to reckon with: for SCAD, for MCP, for Laplace and for Arctan. In previous literature [6], has been suggested as an optimal parameter for SCAD based on Bayes risk criterion. Under different settings, and have been considered as optimal values for the MCP parameter [25]. for SCAD and and are the parameter values that we considered for all data analyses in this paper. After trial error runs (i.e a grid search) with one dataset (MNIST), we settled on 1 and 100 for Arctan and for Laplace penalty.

MNIST:

Modified National Institute of Standards and Technology (MNIST) dataset is a widely used toy dataset of 60,000 grey-scale images of hand-written digits. Each image has pixels. The intensity measures of these 784 pixels form the input variables of the model. The dataset was split into 48,000 training set, 2000 validation set, and 10,000 test set.

Figure 2: Test error rates, from MNIST DNN analysis, corresponding to a grid of values. Horizontal line in each panel corresponds to the error rate obtained without regularization. Left-most panel contains the error rates based on convex penalties; the middle and right panels contain error rates based on nonconvex penalties.

The test error rate obtained with no regularization was 1.87. With regularization, the test error reduced to 1.24 and with regularization the test error was even lower: 1.23. The test error rate of 1.23, which was the lowest among all the methods considered, was also obtained by the Laplace penalty. Arctan method gave test error rates of 1.26 and 1.25, with 1 and 100 respectively, which were comparable to error rates obtained with and regularization. SCAD and MCP methods improved the error rates compared with no regularization, but the results were not as good as those obtained with other penalty functions. An interesting feature of the results based on Laplace and Arctan penalties is their large fluctuation with . Although Laplace gave the best rate (1.23, same as ) for a particular , there was another value for which the test error was near 6. This phenomenon was observed for error rates based on Laplace and Arctan penalties from the other two data analyses as well.

FMNIST:

Fashion MNIST dataset consists of 60,000 images of various types of clothing such as shirts, pants, and caps. There are 10 classes in total. The images have pixels whose intensity measures were used as the input variables of the model. The 60,000 images were split into 45,000 training set, 5000 validation set, and 10,000 test set. FMNIST is very similar to MNIST because of similar characteristics except that the FMNIST test error rates tend to be much higher than MNIST test error rates.

Figure 3: Test error rates, from FMNIST DNN analysis, corresponding to a grid of values. Horizontal line in each panel corresponds to the error rate obtained without regularization. Left-most panel contains the error rates based on convex penalties; the middle and right panels contain error rates based on nonconvex penalties.

The test error rate obtained with no regularization was 11.94. With regularization, the test error reduced to 10.15, and with the test error rate was 10.06. SCAD and MCP improved upon the error rate obtained without any regularization, but not lower than those obtained with or . The new non-convex penalties presented in this paper achieved lower rates than and , with Arctan ( = 1) giving the lowest error rate: Laplace, 9.98, Arctan ( = 1), 9.87 and Arctan ( = 100), 9.93.

RCV1:

Reuters Corpus Volume I (RCV1) is a collection of 804,414 newswire articles labelled as belonging to one or more of 103 news categories [19]. In our analysis, only single-labelled data points were used and all the multi-labelled data points were excluded resulting in a dataset consisting of news wire articles from 50 news categories. The cosine-normalized, log TF-IDF values of 47,236 words appearing in these news articles were used as the input variables for the model. The training set consisted of 13,355 data points; the validation set consisted of 2000 data points; and the test set consisted of 49,565 data points.

Figure 4: Test error rates, from RCV1 (Reuters) DNN analysis, corresponding to a grid of values. Horizontal line in each panel corresponds to the error rate obtained without regularization. Left-most panel contains the error rates based on convex penalties; the middle and right panels contain error rates based on nonconvex penalties.

The test error rate obtained with no regularization was 14.66. With Lasso regularization, the test error reduced to 12.97. Laplace method gave the lowest test error of 12.94. The test error rate with all other methods were below 14.66 but above 13. MCP with 1.5 fared better Arctan penalty for this data analysis.

All the results from all data analyses are summarized in Table 2 below. Detailed results based on all seed values are given in the tables in Appendix B.

Table 2. Median test error rates at optimal
Penalty function Dataset
MNIST FMNIST RCV1
None 1.87 11.94 14.66
(Lasso) 1.24 10.06 12.97
(Ridge) \colorblue 1.23\colorblack 10.15 13.77
SCAD (a = 3.7) 1.80 11.45 13.96
MCP (b = 1.5) 1.60 11.39 13.33
MCP (b = 5) 1.67 11.39 14.44
MCP (b = 20) 1.65 11.35 14.36
Laplace () \colorblue 1.23\colorblack 9.98 \colorblue 12.94\colorblack
Arctan ( = 1) 1.26 \colorblue 9.87\colorblack 13.41
Arctan ( = 100) 1.25 9.93 13.81

4 Discussion

Non-convex regularizers were originally considered in statistical literature after observing certain limitations of the convex regularizers from the class of Bridge functions. Yet, non-convex regularizers never gained as much popularity as their convex counterparts in DNN applications, perhaps because of certain perceived computational and optimization limitations - that is, in the presence of local optima which are not global optima, in the case of non-convex functions, iterative methods such as gradient or coordinate descent may terminate undesirably at a local optimum. However, recent theoretical work [8,9] that established regularity conditions under which both local and global minimum lie within a small neighborhood of the true minimum have brought the limelight back onto non-convex regularizers. The new theory eliminates the need for specially designed optimization algorithms for most non-convex regularizers as it implies that standard first-order optimization methods will converge to points within statistical error of the truth. In other words, non-convex regularizers that satisfy such regularity conditions enjoy guarantees for both statistical accuracy and optimization efficiency.

Penalty functions typically considered for regularization of DNN are convex. In this paper, we present non-convex penalty functions (Laplace, Arctan, SCAD and MCP) that are typically not considered in the DNN literature. Arctan penalty function has not been considered in any statistic literature previously to the best of our knowledge. We studied the performance of the non-convex penalty functions while applying DNN on three large datasets (MNIST, FMNIST and RCV1). Test error rates for Laplace and Arctan penalty functions were comparable to or better than those obtained by the convex penalties.

Another popular and very efficient approach for DNN regularization is Dropout [22] in which units along with their connections are randomly dropped. One way to understand the good performance of Dropout is within the statistical framework of model averaging. A more recently proposed method for DNN regularization is Shakeout [23] in which a unit’s contribution to the next layer is randomly enhanced or reversed. By choosing the enhance factor to be 1 and the reverse factor to be 0, Dropout can be considered as a special case of Shakeout. When the data is scarce, it has been observed that Shakeout outperforms Dropout [23]. Although the operating characteristics and implementation of Dropout and Shakeout are different from the methods mentioned in this paper, there are certain connections among the underlying paradigms. By marginalizing the noise, equivalency between Dropout and adaptive regularization has been established in the literature [22,24]. Theoretically, Shakeout regularizer can be seen as a linear combination of , and penalty functions [23]. Plotting the regularization effect as a function of a single weight (e.g. see figure 2 in [23]) makes the comparison even more clearer. Dropout and Shakeout are similar to non-convex regularizers except in a neighborhood near zero. Near zero, the Dropout regularizer is similar to penalty function (i.e. quadratic), while the Shakeout regularizer is sharp and discontinuous.

Non-convex penalties introduce yet another parameter into the objective function. Ideally a grid of values in the domain of this extra parameter should be considered. However, this could lead to substantial increase in computational time. Cross-validation strategy that may work for small scale datasets becomes computationally intensive in DNN applications. Based on Bayes risk criterion and simulations, was suggested as an optimal value for the extra parameter in SCAD [6]. For MCP parameter , a value of 1.5 was found optimal in one setting while values 5 and 20 were optimal (compared to Lasso) in another setting in [25]. The method, based on path of solutions, suggested in [25] could be considered for determining an optimal value of the extra parameter (say ) in any non-convex penalty. In this method, first an interval for is determined in which the objective function is “locally convex”. For a given , if the chosen solution lies in this region, then can be increased to make the penalty more convex, and if it lies outside the region one can lower without fear of non-convexity. Iterate this process a few times to obtain a that strikes a balance between parsimony and convexity (of the objective function). Other parameter selection methods within the deep learning literature such as grid search, random search [26], Bayesian optimization methods [27] and gradient-based optimization [28] could also be used. For DNN data analyses conducted in this paper, we considered only one parameter value ( = 1e-07) for the Laplace penalty and two parameter values for Arctan ( = 1 and 100) penalty. The performance of the regularization methods even with this limited range of values for the corresponding parameters was very good for the datasets that we considered.

References

References

  • (1) Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533-536.
  • (2) Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning internal representations by error propagation. In Rumelhart, D. E. and McClelland, J. L., editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations Volume 1: Foundations, MIT Press, Cambridge, MA.
  • (3) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., Vol. 58, No. 1, pages 267-288.
  • (4) Hoerl, A.E. and Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics,12:55-67
  • (5) Frank, I. and Friedman, J. (1993). A statistical view of some chemometrics regression tools.Technometrics,35, 109–148.
  • (6) J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96:1348–1360, 2001.
  • (7) C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38(2):894–942, 2010.
  • (8) P. Loh and M.J. Wainwright. Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research 16 (2015) 559–616.
  • (9) P. Loh and M. J. Wainwright. Support recovery without incoherence: A case for nonconvex regularization. Annals of Statistics 45(6): 2455-2482, 2017.
  • (10) Fu, W. J.(1998). Penalized regressions: the Bridge versus the Lasso.J. Comput. Graph. Statist.7397–416.
  • (11) Knight K and Fu W (2000). Asymptotics for lasso-type estimators Ann. Statist. 28, no. 5, 1356-1378.
  • (12) Trzasko, J and Manduca, A. Highly undersampled magnetic resonance image reconstruction via homotopic L0-minimization. IEEE Transactions on Medical Imaging. Vol 28, Issue: 1, Jan. 2009
  • (13) Lu, C., Tang, J., Yan, S. and Lin, Z. 2014. Generalized Nonconvex Nonsmooth Low-Rank Minimization. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’14). IEEE Computer Society, Washington, DC, USA, 4130-4137.
  • (14) Geron, A. Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media; 1 edition (2017)
  • (15) Geman, D and Yang, C. Nonlinear image recovery with half-quadratic regularization. IEEE Transactions on Image Processing ( Volume: 4, Issue: 7, Jul 1995.
  • (16) Glorot, X., Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. PMLR 9:249-256, 2010.
  • (17) He, K., Zhang, X., Ren, S., Sun, J. (2015) Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv eprint 1502.01852
  • (18) Smith, L.N. (2017) Cyclical Learning Rates for Training Neural Networks. arXiv eprint: 1506.01186v6
  • (19) Lewis, D. D., Yang, Y., Rose, T. G., Li, F. (2004). RCV1: A new benchmark collection for text categorization research. The Journal of Machine Learning Research, 5, 361-397.
  • (20) Tarigan, B. and van de Geer, S. (2006). Classifiers of support vector machine type withl1complexity regularization.Bernoulli,12, 1045–1076.
  • (21) Meier, L., van de Geer, S., Buehlmann, P (2008). The group Lasso for logistic regression. JRSS, Series B, 70, 53-71.
  • (22) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural network from overfittingJournal of Machine Learning Research 15 1929-1958.
  • (23) Kang, G., Li, J., Tao, D. (2017). Shakeout: A new approach to regularized deep neural network training. IEEE Trans. Pattern Anal. Mach. Int., vol. 40, no. 5, pp. 1245-1258.
  • (24) Wager, S., Wang, S., and Liang, P.S. (2013). Dropout training as adaptive regularization. Advances in Neural Information Processing Systems, pp. 351-359.
  • (25) Breheny, P., and Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. The Annals of Applied Statistics, 5(1), 232-253.
  • (26) Bergstra, J., and Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research 13, pp. 281-305.
  • (27) Snoek, J., Larochelle, H and Adams, R.P. (2012). Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information processing systems, pp. 2951-2959.
  • (28) Maclaurin, D., Duvenaud, D., and Adams, R.P. (2015). Gradient-based hyperparameter optimization through reversible learning. Proceedings of the 32nd international conference of machine learning.

Appendix A: Statistical Consistency

Statistical consistency results for the weight estimates based on penalty functions in eq.(3) and eq.(4) can be obtained by modifying slightly existing theoretical results [20,21] in the literature. Asymptotic results for SCAD and MCP are presented in [6] and [7]. So, we focus on only Laplace and arctan penalties. Consider the class of logistic classifiers,

Classification is done based on the sign of of the function defined as

Here

denotes the class probability. If there are classes, then the class probability for the class may be modeled as

but for simplicity, we just focus on binary classification.

We assume that is endowed with a probability measure and let be the norm (). Denote . Design matrix consists of copies of . The empirical logistic (also known as cross-entropy) loss is

and theoretical loss

Let

We assume the following three conditions given in [20].

(C1): There exists constants and , such that for all ,

(C2): The smallest eigenvalue of is non-zero.

(C3): for some . Here denotes the unit vector with 1 as the element and 0’s elsewhere.

The following theorem holds for , where equals either the Laplace penalty function given in eq (3) or the arctan penalty function given in eq. (4).

Theorem 4.1.

Assume conditions C1 to C3 hold and that . Then for universal constants ,

where

(9)

The constant in eq. (9) depends on the penalty function: for the Laplace penalty and for the arctan penalty.

Proof.

We give only a sketch of the proof, as the proof is the same as the lengthy proof given in [20], with only minor differences. First of all note that the only difference in the statement of the above theorem from the statement of the theorem 1 in [20] is the constant in eq. (9).

Although the loss function used in [20] was Hinge-loss function, the steps in their proof follows for logistic loss also (-actually it becomes easier-) as pointed out in [21]. Thus, the only difference in the steps in the proof that we need to focus are those corresponding to the penalty functions. Their theorem is stated for the Lasso penalty. The triangle inequality satisfied by the Lasso penalty is used in certain steps of the proof. But, since both Laplace and arctan penalties are subadditive ( - concave in the positive real line, with - ), those steps hold true for these two penalties as well.

The only other step we need to focus is Lemma 5.2 in their proof, where the key inequality used is

(10)

Instead, we use the inequalities

(11)

and

(12)

Inequality in (11) follows from the inequality in (10) and the fact that

which follows easily by considering the function and noting that and

Inequality in (12) follows from the inequality in (10) and the right inequality in Lemma 2.2.

Appendix B: Detailed Tables (Supplementary Material)

DNN analysis was repeated for multiple seed values. The test error rate presented in section 3.3 was the median of the test error rates from all seed values. Detailed results (i.e. test error rates for each grid point and seed value) used for compiling the summarized table in section 3.3 are presented below.

4.1 Mnist

MNIST results with no regularization
Seed 1 2 3 4 5 6 7 8 9 10 Median
Error 1.69 2.54 1.76 2.38 1.70 1.98 2.34 1.72 2.56 1.76 \colorblue 1.87\colorblack
MNIST results with Lasso regularization
Seed 1 Seed 2 Seed 3 Median
-4.00 1.98 1.98 1.69 1.98
-4.20 1.55 1.55 1.52 1.55
-4.40 1.45 1.36 1.32 1.36
-4.60 1.37 1.34 1.41 1.37
-4.80 1.28 1.40 1.29 1.29
-5.00 1.44 1.27 1.24 1.27
-5.20 1.16 1.34 1.24 \colorblue 1.24\colorblack
-5.40 1.45 1.36 1.35 1.36
-5.60 1.36 1.42 1.34 1.36
-5.80 1.24 1.32 1.21 1.24
-6.00 1.26 1.31 1.24 1.26
-6.20 1.72 2.09 1.87 1.87
-6.40 2.16 1.78 2.31 2.16
-6.60 2.14 1.97 1.41 1.97
-6.80 1.97 1.84 2.45 1.97
-7.00 1.84 1.97 2.03 1.97
MNIST results with regularization
Seed 1 Seed 2 Seed 3 Median
-4.00 1.26 1.23 1.22 \colorblue 1.23\colorblack
-4.20 1.36 1.25 1.22 1.25
-4.40 1.45 1.33 1.29 1.33
-4.60 1.31 1.28 1.34 1.31
-4.80 1.33 1.35 1.26 1.33
-5.00 1.90 1.30 1.27 1.30
-5.20 1.70 2.45 2.41 2.41
-5.40 2.21 2.16 2.43 2.21
-5.60 2.09 2.23 2.02 2.09
-5.80 2.43 1.71 1.92 1.92
-6.00 2.14 2.07 2.15 2.14
-6.20 1.99 1.96 2.38 1.99
-6.40 1.83 2.05 2.03 2.03
-6.60 2.02 1.38 1.47 1.47
-6.80 1.88 2.40 1.83 1.88
-7.00 2.19 1.40 1.97 1.97
MNIST results with Laplace ( = 1e-07) regularization
Seed = 1 Seed = 2 Seed = 3 Median
-4.00 6.05 5.78 6.43 6.05
-4.20 3.07 2.70 2.93 2.93
-4.40 2.29 2.37 2.41 2.37
-4.60 2.33 2.10 1.93 2.10
-4.80 2.04 1.77 1.94 1.94
-5.00 1.81 1.86 1.79 1.81
-5.20 1.46 1.53 1.39 1.46
-5.40 1.35 1.43 1.45 1.43
-5.60 1.18 1.36 1.23 \colorblue 1.23\colorblue
-5.80 1.47 1.42 1.45 1.45
-6.00 1.46 1.46 1.35 1.46
-6.20 1.45 1.40 1.41 1.41
-6.40 1.33 1.35 1.35 1.35
-6.60 1.33 1.36 1.30 1.33
-6.80 1.25 1.28 1.34 1.28
-7.00 1.19 1.40 1.33 1.33

MNIST results with Arctan ( = 1) regularization
Seed 1 Seed 2 Seed 3 Median
-4.00 1.70 1.57 1.55 1.57
-4.20 1.35 1.41 1.22 1.35
-4.40 1.33 1.41 1.33 1.33
-4.60 1.26 1.31 1.16 \colorblue 1.26\colorblack
-4.80 1.25 1.33 1.38 1.33
-5.00 1.35 1.35 1.24 1.35
-5.20 1.34 1.38 1.32 1.34
-5.40 1.24 1.42 1.29 1.29
-5.60 1.35 1.34 1.22 1.34
-5.80 1.45 1.40 1.31 1.40
-6.00 1.82 1.80 2.22 1.82
-6.20 2.12 1.75 1.69 1.75
-6.40 2.16 2.05 1.61 2.05
-6.60 2.29 1.74 2.05 2.05
-6.80 1.85 1.97 1.88 1.88
-7.00 2.01 1.79 1.80 1.80
MNIST results with Arctan ( = 100) regularization
Seed 1 Seed 2 Seed 3 Median
-4.00 3.43 3.65 5.71 3.65
-4.20 1.95 3.62 1.75 1.95
-4.40 2.57 1.66 2.45 2.45
-4.60 1.94 2.64 1.36 1.94
-4.80 1.51 1.43 1.49 1.49
-5.00 1.47 1.50 2.11 1.50
-5.20 1.41 1.45 2.02 1.45
-5.40 1.51 1.51 1.50 1.51
-5.60 1.44 1.39 1.57 1.44
-5.80 1.88 1.74 1.57 1.74
-6.00 1.30 1.34 1.31 1.31
-6.20 1.35 1.31 1.31 1.31
-6.40 1.38 1.30 1.29 1.30
-6.60 1.33 1.25 1.25 \colorblue 1.25\colorblack
-6.80 1.39 1.30 1.27 1.30
-7.00 1.34 1.27 1.34 1.34
MNIST results with SCAD ( = 3.7) regularization
Seed 1 Seed 2 Seed 3 Median
-4.00 1.95 2.29 1.92 1.95
-4.20 2.07 2.05 1.78 2.05
-4.40 1.94 2.29 1.92 1.94
-4.60 2.08 1.99 1.67 1.99
-4.80 1.70 1.80 1.95 \colorblue1.80\colorblack
-5.00 2.34 2.26 2.08 2.26
-5.20 2.28 2.11 1.94 2.11
-5.40 2.29 2.01 2.10 2.10
-5.60 2.28 2.39 1.83 2.28
-5.80 2.02 1.88 1.58 1.88
-6.00 2.32 2.06 1.94 2.06
-6.20 2.12 1.74 1.86 1.86
-6.40 2.20 1.97 1.94 1.97
-6.60 1.89 1.61 2.12 1.89
-6.80 2.00 1.44 1.96 1.96
-7.00 2.07 2.08 2.00 2.07
MNIST results with MCP ( = 1.5) regularization
Seed 1 Seed 2 Seed 3 Median
-4.00 1.72 1.97 2.09 1.97
-4.20 2.19 1.89 1.84 1.89
-4.40 2.16 1.88 1.91 1.91
-4.60 1.94 1.92 1.85 1.92
-4.80 1.82 2.38 1.95 1.95
-5.00 2.10 1.93 2.34 2.10
-5.20 1.76 2.28 1.80 1.80
-5.40 2.09 1.33 1.60 \colorblue1.60\colorblack
-5.60 1.80 2.12 1.90 1.90
-5.80 2.01 1.83 2.12 2.01
-6.00 2.24 2.20 1.65 2.20
-6.20 2.16 1.46 1.74 1.74
-6.40 2.00 1.67 2.15 2.00
-6.60 2.02 1.78 1.83 1.83
-6.80 2.02 1.96 1.89 1.96
-7.00 2.02 2.08 2.56 2.08
MNIST results with MCP ( = 5) regularization
Seed 1 Seed 2 Seed 3 Median
-4.00 2.05 1.65 1.98 1.98
-4.20 1.96 2.40 2.08 2.08
-4.40 2.03 1.75 2.11 2.03
-4.60 2.16 2.10 1.86 2.10
-4.80 1.82 1.96 2.14 1.96
-5.00 2.01 2.05 2.03 2.03
-5.20 2.01 2.24 2.09 2.09
-5.40 2.19 2.12 1.96 2.12
-5.60 2.17 1.94 2.02 2.02
-5.80 2.29 1.67 1.64 \colorblue 1.67 \colorblack
-6.00 2.13 2.15 2.40 2.15
-6.20 1.97 2.00 1.80 1.97
-6.40 1.71 1.76 2.09 1.76
-6.60 2.03 2.13 2.10 2.10
-6.80 2.01 2.02 1.78 2.01
-7.00 2.02 2.08 2.00 2.02
MNIST results with MCP ( = 20) regularization
Seed 1 Seed 2 Seed 3 Median
-4.00 2.37 2.01 1.95 2.01
-4.20 2.06 2.38 2.07 2.07
-4.40 2.11 1.64 1.65 \colorblue 1.65 \colorblack
-4.60 1.95 1.64 1.98 1.95
-4.80 1.65 1.94 2.07 1.94
-5.00 1.65 2.22 1.91 1.91
-5.20 2.01 1.74 2.00 2.00
-5.40 2.25 2.72 2.06 2.25
-5.60 1.80 2.39 1.72 1.80
-5.80 2.31 2.04 1.88 2.04
-6.00 1.91 2.13 1.89 1.91
-6.20 1.81 2.10 1.82 1.82
-6.40 2.28 2.40 2.02 2.28
-6.60 1.94 2.36 1.87 1.94
-6.80 1.80 2.18 2.52 2.18
-7.00 2.02 1.67 2.37 2.02

4.2 Fmnist

FMNIST results with no regularization
Seed 1 2 3 4 5 6 7 8 9 10 Median
Error 11.91 11.75 11.70 12.80 11.97 12.47 11.91 12.28 12.38 11.40 \colorblue 11.94\colorblack
FMNIST results with Lasso regularization
Seed 1 Seed 2 Seed 3 Median
-4.00 12.23 12.55 14.30 12.55
-4.20 11.90 11.44 11.75 11.75
-4.40 14.09 10.79 10.92 10.92
-4.60 11.45 10.55 13.68 11.45
-4.80 10.63 13.42 10.58 10.63
-5.00 10.06 10.65 9.97 \colorblue 10.06\colorblack
-5.20 11.04 10.86 11.15 11.04
-5.40 10.53 10.07 9.91 10.07
-5.60 10.46 10.29 10.34 10.34
-5.80 11.09 11.66 11.50 11.50
-6.00 11.97 11.48 11.71 11.71
-6.20 11.88 11.81 11.70 11.81
-6.40 11.57 11.78 11.71 11.71
-6.60 12.00 11.79 11.81 11.81
-6.80 12.10 11.81 11.35 11.81
-7.00 12.04 11.59 11.76 11.76
FMNIST results with regularization
Seed 1 Seed 2 Seed 3 Median
-4.00 10.59 10.32 11.45 10.59
-4.20 9.95 10.27 10.76 10.27
-4.40 10.15 10.20 10.15 \colorblue 10.15\colorblack
-4.60 10.45 10.65 10.15 10.45
-4.80 12.00 12.13 11.73 12.00
-5.00 11.21 12.44 11.77 11.77
-5.20 11.66 11.75 11.53 11.66
-5.40 12.20 11.56 11.65 11.65
-5.60 12.01 12.47 12.05 12.05
-5.80 12.18 11.49 12.44 12.18
-6.00 12.05 11.40 12.22 12.05
-6.20 12.25 11.54 11.54 11.54
-6.40 12.17 12.32 11.77 12.17
-6.60 11.36 11.78 11.62 11.62
-6.80 11.93 11.66 11.38 11.66
-7.00 12.13 12.44 11.61 12.13
FMNIST results with Laplace ( = 1e-07) regularization
Seed = 1 Seed = 2 Seed = 3 Median
-4.00 18.37 19.90 19.70 19.70
-4.20 14.39 14.98 16.91 14.98
-4.40 16.89 16.78 14.02 16.78
-4.60 15.51 15.07 13.01 15.07
-4.80 12.51 12.48 12.29 12.48
-5.00 11.82 11.83 11.93 11.83
-5.20 13.88 11.22 11.05 11.22
-5.40 13.22 10.25 10.53 10.53
-5.60 10.54 12.81 10.19 10.54
-5.80 12.52 9.99 12.16 12.16
-6.00 12.04 10.21 10.92 10.92
-6.20 9.68 9.98 10.52 \colorblue 9.98\colorblack
-6.40 11.23 10.11 10.01 10.11
-6.60 10.06 10.00 10.14 10.06
-6.80 11.49 10.24 10.87 10.87
-7.00 10.69 11.43 10.88 10.88

FMNIST results with Arctan ( = 1) regularization
Seed 1 Seed 2 Seed 3 Median
-4.00 14.07 11.60 11.74 11.74
-4.20 11.44 10.98 11.01 11.01
-4.40 10.94 10.35 11.19 10.94
-4.60 10.62 9.88 10.58 10.58
-4.80 9.84 9.87 10.24 \colorblue 9.87\colorblack
-5.00 10.47 10.21 10.32 10.32
-5.20 9.95 9.91 11.14 9.95
-5.40 10.51 10.29 9.82 10.29
-5.60 11.08 10.44 12.55 11.08
-5.80 11.50 11.66 11.71 11.66
-6.00 12.22 13.05 11.43 12.22
-6.20 12.09 11.69 11.93 11.93
-6.40 12.09 12.67 11.27 12.09
-6.60 11.15 11.39 11.53 11.39
-6.80 12.20 12.88 11.22 12.20
-7.00 11.46 11.47 11.18 11.46
FMNIST results with Arctan ( = 100) regularization
Seed 1 Seed 2 Seed 3 Median
-4.00 15.54 15.84 18.20 15.84
-4.20 15.03 16.42 14.62 15.03
-4.40 14.08 16.67 13.38 14.08
-4.60 14.53 15.96 13.75 14.53
-4.80 17.16 15.39 12.85 15.39
-5.00 11.77 14.65 13.92 13.92
-5.20 11.56 12.39 13.84 12.39
-5.40 9.68 14.27 13.27 13.27
-5.60 9.69 12.23 11.65 11.65
-5.80 9.77 11.00 11.90 11.00
-6.00 9.80 10.13 10.05 10.05
-6.20 12.05 9.73 10.27 10.27
-6.40 10.30 10.33 10.21 10.30
-6.60 9.93 10.61 9.93 \colorblue 9.93\colorblack
-6.80 11.03 11.10 12.74 11.10
-7.00 11.47 11.46 11.00 11.46
FMNIST results with SCAD ( = 3.7) regularization
Seed 1 Seed 2 Seed 3 Median
-4.00 12.12 12.55 11.48 12.12
-4.20 11.98 12.54 11.75 11.98
-4.40 11.96 11.78 11.28 11.78
-4.60 11.83 11.31 11.85 11.83
-4.80 12.14 11.74 11.76 11.76
-5.00 12.15 11.79 11.65 11.79
-5.20 11.56 12.53 11.78 11.78
-5.40 11.90 11.37 11.82 11.82
-5.60 11.83 12.75 11.79 11.83
-5.80 11.72 11.49 11.50 11.50
-6.00 11.92 12.59 11.86 11.92
-6.20 11.80 11.68 11.52 11.68
-6.40 12.18 13.08 12.28 12.28
-6.60 12.19 11.74 11.60 11.74
-6.80 12.06 11.68 11.78 11.78
-7.00 11.07 11.87 11.45 \colorblue 11.45\colorblack
FMNIST results with MCP ( = 1.5) regularization
Seed 1 Seed 2 Seed 3 Median
-4.00 12.15 11.62 11.17 11.62
-4.20 11.99 12.77 11.38 11.99
-4.40 11.52 11.39 11.95 11.52
-4.60 11.66 11.34 12.15 11.66
-4.80 11.05 11.67 11.81 11.67
-5.00 11.39 12.29 11.49 11.49
-5.20 12.25 11.41 11.59 11.59
-5.40 12.24 11.82 11.99 11.99
-5.60 12.06 11.76 12.00 12.00
-5.80 11.44 11.77 11.50 11.50
-6.00 11.18 12.79 11.80 11.80
-6.20 11.44 11.68 12.27 11.68
-6.40 12.51 12.95 11.90 12.51
-6.60 12.03 11.84 11.94 11.94
-6.80 11.95 11.68 12.14 11.95
-7.00 10.96 11.39 11.45 \colorblue 11.39\colorblack
FMNIST results with MCP ( = 5) regularization
Seed 1 Seed 2 Seed 3 Median
-4.00 12.19 12.78 11.38 12.19
-4.20 11.19 12.67 11.60 11.60
-4.40 12.02 11.72 12.33 12.02
-4.60 12.03 11.41 11.37 11.41
-4.80 11.79 11.64 11.74 11.74
-5.00 11.80 11.75 11.82 11.80
-5.20 12.10 12.35 12.02 12.10
-5.40 11.77 13.21 12.18 12.18
-5.60 12.34 12.86 10.67 12.34
-5.80 12.12 11.65 11.95 11.95
-6.00 11.24 11.39 11.40 11.39
-6.20 12.07 12.42 11.81 12.07
-6.40 12.28 12.87 11.78 12.28
-6.60 11.32 11.57 12.18 11.57
-6.80 11.93 11.34 11.85 11.85
-7.00 11.07 11.39 11.45 \colorblue 11.39\colorblack
FMNIST results with MCP ( = 20) regularization